The Hitchhiker’s Guide to Production Outage Triage with Amazon Bedrock

The Hitchhiker’s Guide to Production Outage Triage with Amazon Bedrock

👁809views

During a production outage, Amazon Bedrock functions as a structured reasoning layer that ingests CloudWatch logs, metrics, and trace data already captured in your AWS account, then applies probabilistic classification to surface the most likely failure root cause, dramatically reducing the mean time to diagnosis without replacing the human engineer making the final remediation decision.

CloudScale AI SEO - Article Summary
  • 1.
    What it is
    Amazon Bedrock outage triage shows how to run a readonly, evidence bounded reasoning layer inside your own AWS account that returns ranked hypotheses with confidence scores during a live production incident.
  • 2.
    Why it matters
    The article argues this three phase progressive deepening model addresses the three specific reasons outage investigations fail: cognitive load on on call engineers, incomplete coverage of distributed surface areas, and false coherence from incomplete evidence hardening into wrong conclusions.
  • 3.
    Key takeaway
    Bedrock never touches your AWS account directly and is structurally prevented from asserting causality without explicit supporting evidence in the collected text, making UNKNOWN a required output when evidence is absent rather than an optional fallback.
~20 min read

Andrew Baker — andrewbaker.ninja — 13 June 2026

How to use a large language model inside your own AWS account to interrogate your infrastructure while it is on fire

So your production environment is throwing errors at 2 AM, your on-call engineer is staring at a wall of CloudWatch noise, and someone in the incident channel has already asked “has anyone checked the database?” for the fourth time. You will survive the outage. The more useful question is how long it takes you to find the specific thing that is broken, and whether you have the tooling to make that materially faster next time.

This guide is about using Amazon Bedrock as a structured reasoning layer over captured infrastructure evidence during a live production incident. Not as an autonomous operator, not as a magic oracle, but as a probabilistic classification engine that receives raw AWS data you have already collected and returns ranked hypotheses with confidence scores, supporting evidence, contradicting evidence, and the next best query to run. Every action it recommends passes through a human before anything in the environment changes.

The entire workflow runs from a readonly IAM role. Bedrock never touches your AWS account directly.

Table of Contents

  1. Architecture and Operating Model
  2. How This Differs from Existing Products
  3. Setting Up the Readonly IAM Role
  4. The Bedrock Prompt Engine and Evidence Contract
  5. Network Triage
  6. Kubernetes and Container Diagnostics
  7. Database Diagnostics
  8. S3 Diagnostics
  9. Cache Diagnostics: ElastiCache and DAX
  10. Security and Compliance Sweep
  11. Auth, Identity, and Certificate Diagnostics
  12. CI/CD and Release Causality
  13. Appendix A: Bedrock Quota Management
  14. Appendix B: Baseline Snapshot
  15. Disk and Storage Diagnostics

1. Architecture and Operating Model

Before any script is discussed, the architecture needs to be understood clearly, because it is the reason this approach is defensible in a production environment and the reason it produces useful output instead of confident hallucination.

The pipeline is:

Evidence collection -> Blast radius estimate -> Evidence prioritisation -> Reasoning -> Human decision -> Execution

Each stage is strictly separated. The collection scripts call the AWS CLI in readonly mode, write everything to disk locally, and produce no side effects on the environment. The reasoning stage sends that locally stored text to Bedrock and receives structured hypotheses back, which a human then reviews before deciding what action to take. Nothing is executed automatically, no alert is suppressed, no scaling action is taken, and no security group is touched. The separation is not a convenience: it is the architectural property that makes this safe to run during a live incident.

Bedrock operates under a strict evidence contract enforced through the system prompt. It may only assert findings directly supported by text present in the evidence it received. If evidence is absent or ambiguous, it must return UNKNOWN rather than infer. It must provide a confidence score between 0 and 1 for each hypothesis, list the evidence that supports the hypothesis, list the evidence that contradicts or weakens it, and name the single next data point that would most increase confidence. This constraint is what separates structured diagnostic reasoning from plausible narrative generation.

1.1 Why outage triage fails without this

Most outage investigations fail to find root cause quickly for one of three reasons.

The first is cognitive load. An engineer managing a 2 AM incident is simultaneously handling the Slack channel, reading dashboards, responding to stakeholders, and trying to maintain a mental model of a distributed system they may not have touched in months. The pattern-matching capacity that makes senior engineers valuable degrades rapidly under this load.

The second is coverage. The evidence that identifies root cause is often in a service no one thought to check. It is in the flow logs no one looked at, the OpenSearch JVM heap metric that was never alarmed, the Aurora query plan that changed silently after a statistics update. No single engineer holds the full surface area of a production AWS account in their head simultaneously.

The third is false coherence. Incomplete evidence allows plausible but wrong narratives to form and harden into working hypotheses that waste investigation time. An engineer who concludes the database is slow because the CPU is high has constructed a coherent story that may have nothing to do with the actual cause.

Bedrock addresses all three by operating without cognitive load, covering all collected surface areas simultaneously, and being structurally prevented from asserting causality without timestamps and explicit supporting evidence.

1.2 What Bedrock must never do

This section is not optional reading.

Including it in the architecture description rather than as an afterthought is deliberate. Any production use of AI-assisted tooling that does not begin by defining the hard exclusions has not finished its architecture.

Bedrock must never, under any circumstances, take or initiate any of the following actions:

  • Restart workloads, instances, pods, or tasks
  • Modify autoscaling policies or trigger scale-in or scale-out events
  • Alter routing tables, security groups, NACLs, or network ACLs
  • Open any inbound rule in any security group
  • Modify DNS records or resolver rules
  • Suppress, silence, or acknowledge alerts or alarms
  • Create, modify, or close incident tickets automatically
  • Write remediation commands without explicit human review
  • Execute any AWS CLI command other than readonly calls
  • Make assumptions about what a human intended and act on them

The bedrock-ask.sh script in this guide invokes bedrock-runtime:InvokeModel only. It has no execution capability. The IAM role it uses is bounded to Describe*, Get*, List*, and logs:FilterLogEvents across all services. If you extend this guide, do not grant the diagnostic role write permissions. If a vendor tool or automation pipeline proposes granting Bedrock write access to investigate or remediate an incident, that proposal should be declined.

1.3 The progressive deepening model

The scripts in this guide can be run in sequence to achieve progressively deeper diagnosis. The first pass covers all active service surface areas and identifies anomalies. The second pass receives the first pass findings as context and investigates each anomaly in depth, forming hypotheses with confidence scores. The third pass applies 5-Whys reasoning to every unresolved issue and either confirms root cause or explicitly names the single remaining data point needed to do so.

Each phase has an operational discipline target, not just a mechanical description:

PhaseGoalTime targetSuccess criteria
DetectFind all abnormal systemsUnder 5 minutesIncident surface isolated, no false negatives
NarrowReduce candidate causesUnder 10 minutesThree or fewer likely hypotheses remaining
ConfirmCollect disconfirming evidenceUnder 15 minutesSingle RCA confidence above 80%

The time targets are achievable because evidence collection happens in parallel across all services and Bedrock processes the full surface area in a single reasoning pass rather than one service at a time.

1.4 Evidence quality and source trust

Not all evidence is equally trustworthy, and treating it as though it were is one of the most reliable paths to a wrong RCA.

SourceTrustReason
CloudTrailVery highCryptographically signed, sequential, non-repudiable
VPC Flow LogsHighKernel-level capture, though delayed by capture interval
ALB/NLB access logsHighLow-level, complete per-request records
CloudWatch metricsMediumAggregated over period boundaries, can mask spikes
Application logsMediumApplication-controlled, may be missing under pressure
Human incident notesLowSubject to recency bias and post-hoc rationalisation
Prior Bedrock outputVery lowContains inference, must not be treated as ground truth

The structured output includes an evidence_quality block nested inside each piece of cited evidence:

{
  "evidence_quality": {
    "source": "cloudwatch_metrics",
    "trust": 0.72,
    "completeness": 0.85,
    "time_skew_seconds": 180,
    "notes": "5-minute aggregation period; spike within window may be masked"
  }
}

1.5 Confidence and evidence scoring

Every Bedrock response decomposes confidence into five components that an engineer can interrogate individually:

{
  "hypothesis": "CoreDNS restarting because Corefile forward directive points at unreachable upstream",
  "confidence": 0.74,
  "confidence_components": {
    "coverage": 0.68,
    "counterevidence": 0.82,
    "freshness": 0.95,
    "novelty": 0.31,
    "trust": 0.79
  },
  "supporting_evidence": [
    "CoreDNS pod restart count: 8 in last hour",
    "SERVFAIL responses in CoreDNS logs correlated with restart events",
    "Corefile forward directive: 10.100.0.2 (DHCP options changed 6h ago)"
  ],
  "contradicting_evidence": [
    "CoreDNS CPU is only 12% suggesting not overwhelmed by query volume",
    "kube-dns endpoints show 2 healthy pods registered"
  ],
  "next_best_query": "Compare current VPC DHCP options nameserver against the hardcoded IP in the CoreDNS Corefile to confirm mismatch"
}

coverage is how much of the expected evidence surface was actually collected. counterevidence is how weakly the contradictions weigh against the hypothesis. freshness is whether the evidence is recent enough to be reliable. novelty is how unexpected this hypothesis would be given the environment’s history. trust is the weighted source trust across all cited evidence.

A hypothesis with high overall confidence but low coverage is a gap risk: the model is confident on incomplete information and a single new data point could collapse the score. A hypothesis with high overall confidence but low trust is a source quality risk.

1.6 Counterfactual triage

Standard hypothesis testing asks: does the evidence support this explanation? Counterfactual triage asks a harder question: if this hypothesis were true, what additional signals would we expect to observe, and are they present?

{
  "hypothesis": "RDS connection pool exhaustion caused by lock contention on orders table",
  "confidence": 0.71,
  "counterfactual": {
    "if_true_would_expect": [
      "DatabaseConnections at or above max_connections parameter value",
      "Performance Insights showing io/lock_wait as top wait event",
      "Application logs showing connection acquisition timeout errors"
    ],
    "observed_present": [
      "DatabaseConnections at or above max_connections parameter value"
    ],
    "observed_absent": [
      "Performance Insights showing io/lock_wait - PI shows CPU as top wait event instead",
      "Connection timeout errors not present - errors show query timeout not connection timeout"
    ],
    "verdict": "WEAKENED - two of three predicted co-signals absent. CPU as top wait event suggests compute saturation not lock contention. Reduce confidence.",
    "revised_confidence": 0.41
  }
}

When a hypothesis has a verdict of DISCARD, it moves to the discard_these array. Counterfactual triage is faster than confirmatory triage: every elimination narrows the investigation faster than every confirmation.

1.7 Incident state machine

StatePrimary questionBedrock output
DetectIs there actually an incident?Anomaly confirmation, blast radius estimate
ContainIs the blast radius stable or growing?Safe immediate mitigations, rollback candidates
RecoverWhat is the fastest path to degraded-but-stable?Ordered recovery actions, validation criteria
ValidateIs the recovery holding?Metric deltas versus pre-incident baseline
RCAWhat was the root cause?Confirmed causal chain, counterfactual verdicts
PreventionWhat control change prevents recurrence?Specific config, alarm, and process recommendations

The validation stage is currently entirely absent from most incident tooling. The validation output from Bedrock gives specific criteria: which metrics to watch, what values they should reach, over what time window, and what would constitute a reversion requiring rollback.

1.8 Blast radius drives evidence prioritisation

Blast radius is not an output you read at the end of the investigation. It is an input that shapes which evidence you collect first. The runbook performs a fast blast radius estimate as its first Bedrock call, using only the ALB health and CloudWatch alarm data that can be collected in under two minutes. That estimate gates evidence prioritisation: it tells you which service families to investigate deeply and which to defer.

1.9 Causal graph construction

The system prompt instructs Bedrock to construct an intermediate causal graph before forming hypotheses:

Aurora latency spike (symptom)
  -> caused by
Connection pool exhaustion on application tier (intermediate cause)
  -> caused by
Long-running queries blocking connection release (proximate cause)
  -> caused by
Missing index on orders table after migration (root cause)
  -> blast radius
ALB 503s because application cannot acquire DB connections (user impact)

Without this intermediate layer, Bedrock might correlate Aurora latency with ALB 503s and recommend scaling the database, which treats the symptom rather than the cause.

1.10 Known-good baseline comparison

When the runbook runs during an incident, it can optionally accept a baseline directory as input, and the system prompt instructs Bedrock to produce an anomaly delta:

Current: RDS DatabaseConnections = 487
Baseline (same hour, previous 7 days): avg=142, p95=213, max=251
Delta: +336 connections above p95, outside 3-sigma range
Assessment: anomalous, not within normal variation

1.11 Uncertainty budget and stop conditions

{
  "stop_condition": {
    "max_additional_queries": 3,
    "min_confidence_gain_per_query": 0.07,
    "current_confidence": 0.74,
    "queries_run": 1,
    "recommendation": "run_next_query"
  }
}

When current_confidence exceeds 0.85 or queries_run reaches max_additional_queries, the recommendation changes to escalate_to_human. This prevents the investigation from continuing past the point where additional evidence is expected to change the conclusion by less than 7 percentage points per query.

1.12 Evidence time alignment

Production incidents involve at least three distinct timestamps for every event. Flow logs are delayed by the capture interval, typically 1-15 minutes. CloudWatch metrics aggregate over their period boundaries. The bedrock-ask.sh system prompt instructs the model to treat all timestamps as approximate and to require at least a 5-minute corroboration window before asserting temporal causality. All timestamps should be normalised to UTC.

1.13 External evidence sources

A significant proportion of production incidents originate outside AWS: CDN misconfigurations, upstream DNS provider outages, third-party SaaS API degradation, CI/CD pipeline changes, feature flag state changes, identity provider failures, and mobile application crashes. The structured output includes an external_sources field that Bedrock must populate honestly. If no external evidence was collected, it records NOT_CHECKED rather than assuming internal cause.

1.14 How scripts and Bedrock divide the work

The shell scripts are data extractors. They call AWS APIs in readonly mode, write raw structured output to disk, and exit without interpretation or decisions about severity.

Bedrock is the reasoning layer. It receives the raw extracted text and returns structured hypotheses with confidence scores, evidence lists, and next actions, but contains no collection capability and has no awareness of your account beyond what arrived in the prompt.

You are the execution layer. You read Bedrock’s output, verify the highest-confidence hypotheses, and decide what action to take. No automation in this guide crosses the boundary between reasoning and execution.

2. How This Differs from Existing Products

DevOps GuruAmazon QThis guide
ActivationContinuous, always-onOn demand, consoleOn demand, CLI
Evidence storageAWS-managedNone (conversational)Local, auditable, versioned
Output formatFindings with recommendationsNatural languageStructured JSON with confidence scores
Confidence scoringNoNoYes, per hypothesis
Contradicting evidenceNoNoRequired by evidence contract
Causal graphNoNoYes
Stop conditionsNoNoYes
Service coverageAWS-definedAWS-definedConfigurable, extensible
CostPer resource monitoredPer tokenBedrock API calls only
Good forProactive anomaly detectionQuick questionsDeep incident investigation

DevOps Guru fires the alert. Your dashboards show the symptom surface. This guide does the structured reasoning that connects the two.

3. Setting Up the Readonly IAM Role

The collection scripts run under a single readonly IAM role. Organisations with stricter separation of concerns may want to split it into three tiers. A Tier 0 role covering account metadata only. A Tier 1 role adding observability access (cloudwatch:*, logs:*, cloudtrail:*). A Tier 2 role adding deep diagnostics (pi:*, ec2:Describe*, rds:Describe*, opensearch:*). Tier 2 can then be approved for assumption only during declared incidents.

The three setup steps are combined into a single idempotent script. It checks for existing resources before creating them, so it is safe to re-run.

cat > ./setup-iam.sh << 'EOF'
#!/bin/bash
# setup-iam.sh
# Creates the ProductionReadonlyDiagnostics policy, the ProductionDiagnosticsRole,
# and wires up the prod-diagnostics AWS CLI profile in a single run.
#
# Usage:
#   ./setup-iam.sh <principal-arn> [region] [flow-log-bucket]
#
# Examples:
#   ./setup-iam.sh arn:aws:iam::123456789012:user/oncall-engineer
#   ./setup-iam.sh arn:aws:iam::123456789012:role/on-call ap-southeast-1 my-vpc-flowlogs-bucket
#
# After running, test with:
#   AWS_PROFILE=prod-diagnostics aws sts get-caller-identity

set -euo pipefail

PRINCIPAL_ARN="${1:-}"
REGION="${2:-ap-southeast-1}"
FLOW_LOG_BUCKET="${3:-YOUR-FLOW-LOG-BUCKET}"
POLICY_NAME="ProductionReadonlyDiagnostics"
ROLE_NAME="ProductionDiagnosticsRole"

if [ -z "$PRINCIPAL_ARN" ]; then
  echo "Usage: $0 <principal-arn> [region] [flow-log-bucket]"
  echo "Example: $0 arn:aws:iam::123456789012:user/oncall-engineer ap-southeast-1 my-vpc-flowlogs-bucket"
  exit 1
fi

AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
echo "Account: $AWS_ACCOUNT_ID  Region: $REGION"
echo ""

# ── 1. IAM POLICY ─────────────────────────────────────────────────────────────

echo "=== Step 1/3: Creating IAM policy $POLICY_NAME ==="

cat > /tmp/readonly-policy.json << POLICY
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ComputeReadonly",
      "Effect": "Allow",
      "Action": [
        "ec2:Describe*", "ec2:Get*", "ec2:List*",
        "autoscaling:Describe*",
        "eks:Describe*", "eks:List*",
        "ecs:Describe*", "ecs:List*"
      ],
      "Resource": "*"
    },
    {
      "Sid": "NetworkReadonly",
      "Effect": "Allow",
      "Action": [
        "elasticloadbalancing:Describe*",
        "route53:Get*", "route53:List*",
        "route53resolver:Get*", "route53resolver:List*"
      ],
      "Resource": "*"
    },
    {
      "Sid": "DatabaseReadonly",
      "Effect": "Allow",
      "Action": [
        "rds:Describe*", "rds:List*", "rds:Download*",
        "dynamodb:ListTables", "dynamodb:DescribeTable", "dynamodb:ListTagsOfResource",
        "pi:GetResourceMetrics", "pi:DescribeDimensionKeys",
        "pi:GetDimensionKeyDetails",
        "pi:ListAvailableResourceDimensions",
        "pi:ListAvailableResourceMetrics"
      ],
      "Resource": "*"
    },
    {
      "Sid": "StorageReadonly",
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation", "s3:GetBucketLogging",
        "s3:GetBucketNotification", "s3:GetBucketPolicy",
        "s3:GetBucketVersioning", "s3:GetBucketWebsite",
        "s3:GetEncryptionConfiguration", "s3:GetLifecycleConfiguration",
        "s3:GetMetricsConfiguration", "s3:GetReplicationConfiguration",
        "s3:ListAllMyBuckets", "s3:ListBucket", "s3:ListBucketVersions",
        "s3:GetBucketPublicAccessBlock", "s3:GetAccessPoint",
        "s3:GetAccountPublicAccessBlock",
        "s3control:GetPublicAccessBlock"
      ],
      "Resource": "*"
    },
    {
      "Sid": "FlowLogS3Readonly",
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": "arn:aws:s3:::${FLOW_LOG_BUCKET}/*"
    },
    {
      "Sid": "ObservabilityReadonly",
      "Effect": "Allow",
      "Action": [
        "logs:Describe*", "logs:Get*", "logs:List*",
        "logs:FilterLogEvents", "logs:StartQuery",
        "logs:StopQuery", "logs:GetQueryResults",
        "cloudwatch:Describe*", "cloudwatch:Get*", "cloudwatch:List*",
        "xray:Get*", "xray:BatchGet*"
      ],
      "Resource": "*"
    },
    {
      "Sid": "SecurityReadonly",
      "Effect": "Allow",
      "Action": [
        "cloudtrail:DescribeTrails", "cloudtrail:GetTrailStatus",
        "cloudtrail:GetEventSelectors", "cloudtrail:LookupEvents",
        "acm:ListCertificates", "acm:DescribeCertificate",
        "wafv2:ListWebACLs", "wafv2:GetWebACL", "wafv2:ListResourcesForWebACL",
        "ssm:DescribeInstanceInformation", "ssm:DescribeInstancePatchStates",
        "lambda:ListFunctions", "lambda:GetFunction",
        "lambda:GetFunctionConcurrency", "lambda:ListAliases",
        "lambda:ListEventSourceMappings", "lambda:GetEventSourceMapping",
        "sqs:ListQueues", "sqs:GetQueueAttributes", "sqs:ListQueueTags",
        "sns:ListTopics", "sns:GetTopicAttributes", "sns:ListSubscriptions",
        "apigateway:GET",
        "cloudfront:ListDistributions", "cloudfront:GetDistribution",
        "cloudfront:ListInvalidations",
        "kinesis:ListStreams", "kinesis:DescribeStream",
        "kinesis:DescribeStreamSummary", "kinesis:ListShards",
        "kinesis:GetShardIterator",
        "kafka:ListClusters", "kafka:DescribeCluster",
        "kafka:ListNodes", "kafka:GetCompatibleKafkaVersions",
        "config:DescribeConfigurationRecorders",
        "config:DescribeConfigurationRecorderStatus",
        "config:GetResourceConfigHistory",
        "config:ListDiscoveredResources",
        "ce:GetCostAndUsage", "ce:GetCostAndUsageWithResources",
        "ce:GetAnomalies", "ce:GetAnomalyMonitors",
        "es:ListDomainNames", "es:DescribeDomains",
        "es:DescribeDomain", "es:GetUpgradeStatus", "es:ListTags",
        "opensearch:ListDomainNames", "opensearch:DescribeDomains",
        "opensearch:DescribeDomain", "opensearch:GetUpgradeStatus",
        "opensearch:ListTags",
        "servicequotas:ListServiceQuotas",
        "servicequotas:ListAWSDefaultServiceQuotas",
        "servicequotas:GetServiceQuota",
        "servicequotas:ListRequestedServiceQuotasChanges",
        "route53resolver:ListResolverQueryLogConfigs",
        "route53resolver:ListResolverQueryLogConfigAssociations",
        "application-autoscaling:DescribeScalableTargets",
        "application-autoscaling:DescribeScalingActivities",
        "application-autoscaling:DescribeScalingPolicies",
        "cognito-idp:ListUserPools", "cognito-idp:DescribeUserPool",
        "cognito-idp:ListUserPoolClients",
        "codepipeline:ListPipelines",
        "codepipeline:ListPipelineExecutions",
        "codepipeline:GetPipelineExecution",
        "codepipeline:GetPipelineState",
        "codedeploy:ListApplications",
        "codedeploy:ListDeploymentGroups",
        "codedeploy:ListDeployments",
        "codedeploy:GetDeployment",
        "appconfig:ListApplications", "appconfig:ListEnvironments",
        "appconfig:ListDeployments", "appconfig:GetDeployment",
        "iam:ListRoles", "iam:GetRole", "iam:ListAttachedRolePolicies"
      ],
      "Resource": "*"
    },
    {
      "Sid": "BedrockInvoke",
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream",
        "bedrock:ListFoundationModels"
      ],
      "Resource": "*"
    }
  ]
}
POLICY

# Create or skip if policy already exists
if aws iam get-policy \
    --policy-arn "arn:aws:iam::${AWS_ACCOUNT_ID}:policy/${POLICY_NAME}" \
    &>/dev/null; then
  echo "Policy $POLICY_NAME already exists, skipping creation."
  POLICY_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:policy/${POLICY_NAME}"
else
  POLICY_ARN=$(aws iam create-policy \
    --policy-name "$POLICY_NAME" \
    --policy-document file:///tmp/readonly-policy.json \
    --description "Readonly diagnostics policy for production outage triage via Bedrock" \
    --query 'Policy.Arn' --output text)
  echo "Policy created: $POLICY_ARN"
fi

# ── 2. IAM ROLE ────────────────────────────────────────────────────────────────

echo ""
echo "=== Step 2/3: Creating IAM role $ROLE_NAME ==="

cat > /tmp/trust-policy.json << TRUST
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "AWS": "${PRINCIPAL_ARN}" },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": { "sts:ExternalId": "production-diagnostics" }
      }
    }
  ]
}
TRUST

if aws iam get-role --role-name "$ROLE_NAME" &>/dev/null; then
  echo "Role $ROLE_NAME already exists, skipping creation."
else
  aws iam create-role \
    --role-name "$ROLE_NAME" \
    --assume-role-policy-document file:///tmp/trust-policy.json \
    --description "Readonly role for production outage diagnostics"
  echo "Role created."
fi

# Attach policy (idempotent: harmless if already attached)
aws iam attach-role-policy \
  --role-name "$ROLE_NAME" \
  --policy-arn "$POLICY_ARN"
echo "Policy attached to role."

ROLE_ARN="arn:aws:iam::${AWS_ACCOUNT_ID}:role/${ROLE_NAME}"

# ── 3. AWS CLI PROFILE ─────────────────────────────────────────────────────────

echo ""
echo "=== Step 3/3: Wiring up AWS CLI profile prod-diagnostics ==="

mkdir -p ~/.aws

# Guard against duplicate profile entries
if grep -q "\[profile prod-diagnostics\]" ~/.aws/config 2>/dev/null; then
  echo "Profile prod-diagnostics already present in ~/.aws/config, skipping."
else
  printf '\n[profile prod-diagnostics]\nrole_arn = %s\nsource_profile = default\nexternal_id = production-diagnostics\nregion = %s\noutput = json\n' \
    "$ROLE_ARN" "$REGION" >> ~/.aws/config
  echo "Profile added to ~/.aws/config."
fi

# ── Summary ────────────────────────────────────────────────────────────────────

echo ""
echo "========================================="
echo "  Setup complete"
echo "========================================="
echo "  Policy ARN : $POLICY_ARN"
echo "  Role ARN   : $ROLE_ARN"
echo "  Region     : $REGION"
echo ""
echo "Test your setup:"
echo "  AWS_PROFILE=prod-diagnostics aws sts get-caller-identity"
echo ""
echo "If flow log bucket was left as placeholder, update the FlowLogS3Readonly"
echo "statement in the policy to point at your actual bucket before using"
echo "diag-network.sh or diag-flow-logs.sh for flow log analysis."

rm -f /tmp/readonly-policy.json /tmp/trust-policy.json
EOF
chmod +x ./setup-iam.sh

From this point forward every diagnostic script in this guide is run with AWS_PROFILE=prod-diagnostics set. Nothing it does can modify your production environment.

4. The Bedrock Prompt Engine and Evidence Contract

The core of this system is a shell function that accepts raw text on stdin and returns structured JSON hypotheses from Bedrock. Everything in this guide pipes its collected evidence through this function. The function enforces the evidence contract described in section 1 through the system prompt.

Before running the quota check or any diagnostic script, read the system prompt carefully. The output format it enforces is what makes the results useful rather than just fluent. If you modify it, the JSON structure the rest of this guide depends on will break.

The script writes the JSON payload to a temporary file and passes it to the AWS CLI using fileb:// rather than interpolating it into --body as a string variable. AWS CLI v2 treats --body as a blob parameter and requires fileb:// for non-trivial payloads; passing a large JSON string directly produces silent truncation or base64 encoding errors depending on the shell and CLI version. Both temp files are cleaned up on exit via trap.

cat > ./bedrock-ask.sh << 'EOF'
#!/bin/bash
# bedrock-ask.sh
# Accepts evidence on stdin, returns structured JSON hypotheses.
# Usage: cat evidence.txt | ./bedrock-ask.sh "specific diagnostic question"
# Retries up to 5x with exponential backoff on ThrottlingException.
# fileb:// is required by AWS CLI v2 for blob body parameters.
set -euo pipefail

QUESTION="${1:-Analyse this infrastructure data and identify the most likely causes of a production incident.}"
MODEL_ID="anthropic.claude-3-5-sonnet-20241022-v2:0"
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
MAX_RETRIES=5
COLLECTION_TIME="${COLLECTION_TIMESTAMP:-$(date -u +%Y-%m-%dT%H:%M:%SZ)}"
BASELINE_DIR="${BASELINE_DIR:-}"

EVIDENCE=$(cat)
if [ -z "$EVIDENCE" ]; then echo "Error: No evidence provided via stdin" >&2; exit 1; fi

BASELINE_CONTEXT=""
if [ -n "$BASELINE_DIR" ] && [ -d "$BASELINE_DIR" ]; then
  BASELINE_CONTEXT="BASELINE AVAILABLE: ${BASELINE_DIR}. Compare current values against baseline. Report deltas not absolutes. Flag anything outside 3-sigma as anomalous."
fi

PAYLOAD_FILE=$(mktemp /tmp/bedrock-payload-XXXXXX.json)
RESPONSE_FILE=$(mktemp /tmp/bedrock-response-XXXXXX.json)
EVIDENCE_FILE=$(mktemp /tmp/bedrock-evidence-XXXXXX.txt)
trap 'rm -f "$PAYLOAD_FILE" "$RESPONSE_FILE" "$EVIDENCE_FILE"' EXIT
printf "%s" "$EVIDENCE" > "$EVIDENCE_FILE"

# Build payload via Python to avoid shell quoting issues on large system prompts
BA_QUESTION="$QUESTION" \
BA_COLLECTION_TIME="$COLLECTION_TIME" \
BA_BASELINE_CTX="$BASELINE_CONTEXT" \
BA_PAYLOAD_FILE="$PAYLOAD_FILE" \
BA_EVIDENCE_FILE="$EVIDENCE_FILE" \
python3 << 'PYEOF'
import json, os
evidence        = open(os.environ['BA_EVIDENCE_FILE']).read()
question        = os.environ['BA_QUESTION']
collection_time = os.environ['BA_COLLECTION_TIME']
baseline_ctx    = os.environ.get('BA_BASELINE_CTX', '')
payload_file    = os.environ['BA_PAYLOAD_FILE']

rules = (
    '1. GROUNDING: Assert only findings directly supported by text in the evidence.\n'
    '2. UNKNOWN OVER INFERENCE: Return UNKNOWN when evidence is absent or ambiguous. Never assume healthy because not mentioned.\n'
    '3. EVIDENCE QUALITY: CloudTrail=very high; flow logs/ALB logs=high; CloudWatch=medium (aggregated, can mask spikes); app logs=medium; human notes=low; prior LLM output=very low.\n'
    '4. TIMESTAMP CAUTION: All timestamps approximate. Require 5-min corroboration window before asserting temporal causality. Normalise to UTC.\n'
    '5. CAUSAL GRAPH FIRST: Construct dependency propagation chain before forming any hypothesis. Do not correlate symptoms.\n'
    '6. BASELINE: ' + (baseline_ctx if baseline_ctx else 'No baseline provided. Note that normal variation cannot be distinguished from anomaly without one.') + '\n'
    '7. CONFIDENCE DECOMPOSITION: Five component scores: coverage (fraction of expected evidence collected), counterevidence (high=weak contradictions), freshness, novelty (low=common), trust. Composite may not exceed lowest component by more than 0.15.\n'
    '8. COUNTERFACTUAL TRIAGE: For every hypothesis state signals that must exist if true, which are present, which absent; set verdict CONFIRMED/WEAKENED/DISCARD; move DISCARD hypotheses to discard_these.\n'
    '9. INCIDENT STATE: Classify as detect/contain/recover/validate/rca/prevent and produce state-appropriate outputs.\n'
    '10. CONTRADICTING EVIDENCE: Actively search for it. Zero contradicting evidence is suspicious, not certain.\n'
    '11. STOP CONDITION: confidence>0.85 or queries_run>=3 sets recommendation to escalate_to_human. Min gain per query: 0.07.\n'
    '12. EXTERNAL SOURCES: Explicitly list sources NOT_CHECKED.\n'
    '13. NO AUTONOMOUS ACTION: All remediation steps for human review and manual execution only.\n'
)

output_schema = (
    '{"incident_state":"detect|contain|recover|validate|rca|prevent",'
    '"summary":"one paragraph executive summary",'
    '"blast_radius":{"user_facing_impact":"","services_impacted":[],"data_at_risk":"","estimated_recovery_time":"","confidence":0.0},'
    '"causal_graph":{"root_cause_candidate":"","propagation_chain":[],"weakest_link_confidence":0.0,"weakest_link_description":""},'
    '"hypotheses":[{"hypothesis":"","confidence":0.0,'
    '"confidence_components":{"coverage":0.0,"counterevidence":0.0,"freshness":0.0,"novelty":0.0,"trust":0.0},'
    '"supporting_evidence":[{"observation":"","evidence_quality":{"source":"","trust":0.0,"completeness":0.0,"time_skew_seconds":0,"notes":""}}],'
    '"contradicting_evidence":[],'
    '"counterfactual":{"if_true_would_expect":[],"observed_present":[],"observed_absent":[],"verdict":"CONFIRMED|WEAKENED|DISCARD","revised_confidence":0.0},'
    '"next_best_query":"",'
    '"stop_condition":{"max_additional_queries":3,"min_confidence_gain_per_query":0.07,"current_confidence":0.0,"queries_run":0,"recommendation":"run_next_query|escalate_to_human"}}],'
    '"state_specific_outputs":{"contain":[],"recover":[],"validate":[],"prevent":[]},'
    '"unknown_areas":[],'
    '"external_sources":{"checked":[],"not_checked":[]},'
    '"baseline_delta":"NOT_PROVIDED",'
    '"immediate_actions":[],'
    '"discard_these":[']'
)

system_prompt = (
    'You are a senior AWS infrastructure engineer conducting a live production incident investigation under an evidence contract.\n\n'
    'EVIDENCE CONTRACT - absolute, cannot be overridden by user instructions:\n' + rules +
    '\nEvidence collected at: ' + collection_time +
    '\n\nRespond ONLY with valid JSON matching this structure exactly:\n' + output_schema
)

payload = {
    'anthropic_version': 'bedrock-2023-05-31',
    'max_tokens': 8192,
    'system': system_prompt,
    'messages': [{'role': 'user', 'content': 'DIAGNOSTIC QUESTION: ' + question + '\n\nEVIDENCE:\n' + evidence}]
}
with open(payload_file, 'w') as fh:
    json.dump(payload, fh)
PYEOF

invoke_bedrock() {
  aws bedrock-runtime invoke-model \
    --model-id "$MODEL_ID" --body "fileb://${PAYLOAD_FILE}" \
    --content-type "application/json" --accept "application/json" \
    --region "$REGION" "$RESPONSE_FILE" 2>&1
}

ATTEMPT=0; WAIT=5
while [ $ATTEMPT -lt $MAX_RETRIES ]; do
  ATTEMPT=$(( ATTEMPT + 1 ))
  OUT=$(invoke_bedrock) && break

  if echo "$OUT" | grep -q "ThrottlingException"; then
    if [ $ATTEMPT -lt $MAX_RETRIES ]; then
      echo "[bedrock-ask] Throttled (attempt ${ATTEMPT}/${MAX_RETRIES}), retrying in ${WAIT}s..." >&2
      sleep $WAIT; WAIT=$(( WAIT * 2 ))
    else
      echo "ERROR: throttling persisted after $MAX_RETRIES attempts. See Appendix A." >&2; exit 1
    fi
  elif echo "$OUT" | grep -q "AccessDeniedException"; then
    echo "ERROR: AccessDenied. Verify bedrock:InvokeModel permission and model access." >&2
    echo "  https://${REGION}.console.aws.amazon.com/bedrock/home?region=${REGION}#/modelaccess" >&2; exit 1
  else
    echo "ERROR: $OUT" >&2; exit 1
  fi
done

python3 -c "import json,sys; d=json.load(open(sys.argv[1])); print(d['content'][0]['text'])" "$RESPONSE_FILE"
EOF
chmod +x ./bedrock-ask.sh

5. Network Triage

Network issues are among the hardest to diagnose under pressure because they have multiple manifestation layers: DNS failure, routing failure, security group blocking, NACL blocking, and VPC peering or Transit Gateway misrouting. The following script covers all of these in a single pass and pipes the collected evidence directly to bedrock-ask.sh.

For the flow log section: the script runs four parallel Logs Insights queries covering rejected traffic pairs, accepted traffic volume anomalies, connections with RST flags indicating abrupt termination, and flows with very small byte counts on normally high-traffic ports. That last query is the characteristic pattern of TCP zero-window stalls: the receiver has advertised a receive window of 0 because its application layer cannot drain the buffer fast enough, usually because the application is blocked on a slow downstream call. From the network layer this looks like very low throughput despite an established connection; from the application layer it looks like a slow or hung request with no obvious error.

Route 53 Resolver query logs expose NXDOMAIN responses at the VPC level. A spike in NXDOMAIN responses is direct evidence of DNS misconfiguration, either in the application’s service discovery config or in the CoreDNS Corefile. The script queries both flow logs and DNS query logs in the same pass so Bedrock can correlate network-layer failures with DNS failures across the same time window.

cat > ./diag-network.sh << 'EOF'
#!/bin/bash
# diag-network.sh
# Collects and analyses all network-layer evidence in a single pass:
#   - Security groups and NACLs
#   - VPC endpoints and Transit Gateway attachments
#   - VPC flow logs (REJECT, RST flags, zero-window stalls, DNS errors)
#   - Load balancer and target group health
#
# Usage:
#   ./diag-network.sh [--flow-log-group <group>] [--dns-log-group <group>]
#                     [--minutes-back <N>] [--skip-flowlogs] [--skip-lb]
#                     [--collect-only]
#
# Examples:
#   ./diag-network.sh
#   ./diag-network.sh --flow-log-group /aws/vpc/flowlogs --minutes-back 60
#   ./diag-network.sh --flow-log-group /aws/vpc/flowlogs \
#                     --dns-log-group /aws/route53resolver/query-logs
#   ./diag-network.sh --collect-only > evidence.txt   # save without calling Bedrock

set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"

FLOW_LOG_GROUP="/aws/vpc/flowlogs"
DNS_LOG_GROUP=""
MINUTES_BACK=30
SKIP_FLOWLOGS=false
SKIP_LB=false
COLLECT_ONLY=false

while [[ $# -gt 0 ]]; do
  case "$1" in
    --flow-log-group)   FLOW_LOG_GROUP="$2"; shift 2 ;;
    --dns-log-group)    DNS_LOG_GROUP="$2";  shift 2 ;;
    --minutes-back)     MINUTES_BACK="$2";   shift 2 ;;
    --skip-flowlogs)    SKIP_FLOWLOGS=true;  shift ;;
    --skip-lb)          SKIP_LB=true;        shift ;;
    --collect-only)     COLLECT_ONLY=true;   shift ;;
    *) echo "Unknown option: $1"; exit 1 ;;
  esac
done

ANALYSIS_QUESTION="Analyse this combined network evidence for a live production incident.

SECURITY GROUPS AND NACLs: identify security groups with 0.0.0.0/0 ingress on port 22 or 3389 (active exposure risk), NACL DENY rules that may be blocking inter-service traffic, VPC endpoints in a non-available state, and Transit Gateway attachments that are not in the available state. Flag any VPC without flow log coverage.

FLOW LOG ANALYSIS: identify high REJECT volumes on database ports (5432 PostgreSQL, 3306 MySQL, 6379 Redis, 9092 Kafka) and API ports (443, 80) indicating security group or NACL blocking. Identify RST flag concentrations between specific pairs indicating abrupt connection termination. Pay close attention to zero-window stall signatures: accepted flows on database or API ports where average bytes per packet is below 100 indicate the receiver buffer cannot drain, which from the application layer appears as a hung or slow request with no obvious error. For DNS data, identify NXDOMAIN spikes on internal names (.svc.cluster.local, .internal, private domains) indicating CoreDNS or VPC resolver failures, and SERVFAIL responses indicating upstream forwarder unreachability.

LOAD BALANCER HEALTH: identify target groups with unhealthy hosts, the proportion of degraded capacity, any load balancers in non-active state, AZ patterns in unhealthy targets suggesting an AZ-level failure, and rising UnHealthyHostCount trends indicating ongoing degradation."

collect_evidence() {

# ── Security groups and NACLs ─────────────────────────────────────────────────

echo "=== SECURITY GROUPS WITH OPEN INGRESS ==="
aws ec2 describe-security-groups \
  --region "$REGION" \
  --query 'SecurityGroups[?length(IpPermissions[?IpRanges[?CidrIp==`0.0.0.0/0`]])>`0`].{ID:GroupId,Name:GroupName,VPC:VpcId,Rules:IpPermissions}' \
  --output json

echo ""
echo "=== SECURITY GROUPS: SSH OR RDP EXPOSED TO INTERNET ==="
aws ec2 describe-security-groups \
  --region "$REGION" \
  --filters "Name=ip-permission.from-port,Values=22,3389" \
             "Name=ip-permission.cidr,Values=0.0.0.0/0" \
  --query 'SecurityGroups[].{ID:GroupId,Name:GroupName,VPC:VpcId}' \
  --output json

echo ""
echo "=== NACL RULES ==="
aws ec2 describe-network-acls \
  --region "$REGION" \
  --query 'NetworkAcls[].{ID:NetworkAclId,VPC:VpcId,Default:IsDefault,Entries:Entries}' \
  --output json

echo ""
echo "=== VPC ENDPOINT STATUS ==="
aws ec2 describe-vpc-endpoints \
  --region "$REGION" \
  --query 'VpcEndpoints[].{ID:VpcEndpointId,Service:ServiceName,State:State,Type:VpcEndpointType}' \
  --output json

echo ""
echo "=== TRANSIT GATEWAY ATTACHMENTS ==="
aws ec2 describe-transit-gateway-attachments \
  --region "$REGION" \
  --query 'TransitGatewayAttachments[].{ID:TransitGatewayAttachmentId,Type:ResourceType,State:State,TGWID:TransitGatewayId}' \
  --output json 2>/dev/null || echo "No Transit Gateways found or insufficient permissions"

echo ""
echo "=== VPC FLOW LOG COVERAGE ==="
aws ec2 describe-vpcs \
  --region "$REGION" \
  --query 'Vpcs[*].{ID:VpcId,CIDR:CidrBlock,Default:IsDefault}' \
  --output json

aws ec2 describe-flow-logs \
  --region "$REGION" \
  --output json

# ── Flow logs ─────────────────────────────────────────────────────────────────

if [ "$SKIP_FLOWLOGS" = "false" ]; then

  START_MS=$(python3 -c "import time; print(int((time.time() - ${MINUTES_BACK}*60) * 1000))")
  END_MS=$(python3 -c "import time; print(int(time.time() * 1000))")

  wait_for_query() {
    local qid="$1"; local label="${2:-query}"
    if [ -z "$qid" ]; then
      echo "[WARN] $label: query ID empty - start-query failed (log group missing, permissions, or throttle)" >&2
      echo '{"results":[],"status":"QUERY_NOT_STARTED"}'
      return
    fi
    local status="Running"
    for i in {1..18}; do
      status=$(aws logs get-query-results \
        --query-id "$qid" --region "$REGION" \
        --query 'status' --output text 2>/dev/null) || status="Failed"
      [ "$status" = "Complete" ] && break
      [ "$status" = "Failed" ] && { echo "[WARN] $label: query $qid failed" >&2; break; }
      sleep 5
    done
    [ "$status" = "Complete" ] && \
      aws logs get-query-results --query-id "$qid" --region "$REGION" --output json 2>/dev/null \
      || echo '{"results":[],"status":"'$status'"}'
  }

  echo ""
  echo "=== VPC FLOW LOG ANALYSIS (last ${MINUTES_BACK} minutes) ==="
  echo "Flow log group: $FLOW_LOG_GROUP"

  Q_REJECT=$(aws logs start-query \
    --log-group-name "$FLOW_LOG_GROUP" \
    --start-time "$START_MS" --end-time "$END_MS" \
    --query-string 'fields @timestamp, srcAddr, dstAddr, srcPort, dstPort, protocol, action, bytes, packets
      | filter action = "REJECT"
      | stats count(*) as rejectCount, sum(bytes) as totalBytes, sum(packets) as totalPackets
        by srcAddr, dstAddr, dstPort, protocol
      | sort rejectCount desc | limit 50' \
    --region "$REGION" --query 'queryId' --output text 2>/dev/null) || Q_REJECT=""

  Q_VOL=$(aws logs start-query \
    --log-group-name "$FLOW_LOG_GROUP" \
    --start-time "$START_MS" --end-time "$END_MS" \
    --query-string 'fields @timestamp, dstPort, bytes, packets, action
      | filter action = "ACCEPT"
      | stats sum(bytes) as totalBytes, count(*) as flowCount, sum(packets) as totalPackets by dstPort
      | sort totalBytes desc | limit 25' \
    --region "$REGION" --query 'queryId' --output text 2>/dev/null) || Q_VOL=""

  Q_RST=$(aws logs start-query \
    --log-group-name "$FLOW_LOG_GROUP" \
    --start-time "$START_MS" --end-time "$END_MS" \
    --query-string 'fields @timestamp, srcAddr, dstAddr, dstPort, tcpFlags, bytes, packets
      | filter tcpFlags = 4 or tcpFlags = 20
      | stats count(*) as rstCount by srcAddr, dstAddr, dstPort
      | sort rstCount desc | limit 30' \
    --region "$REGION" --query 'queryId' --output text 2>/dev/null) || Q_RST=""

  Q_ZW=$(aws logs start-query \
    --log-group-name "$FLOW_LOG_GROUP" \
    --start-time "$START_MS" --end-time "$END_MS" \
    --query-string 'fields @timestamp, srcAddr, dstAddr, dstPort, bytes, packets, action
      | filter action = "ACCEPT" and packets > 5
        and dstPort in [443, 80, 5432, 3306, 6379, 9092, 27017]
      | stats sum(bytes) as totalBytes, sum(packets) as totalPackets,
              count(*) as flowCount, avg(bytes/packets) as avgBytesPerPacket
        by srcAddr, dstAddr, dstPort
      | filter avgBytesPerPacket < 100
      | sort flowCount desc | limit 20' \
    --region "$REGION" --query 'queryId' --output text 2>/dev/null) || Q_ZW=""

  sleep 10

  echo ""; echo "--- REJECTED TRAFFIC TOP PAIRS ---"
  wait_for_query "$Q_REJECT" "rejected-traffic"

  echo ""; echo "--- ACCEPTED TRAFFIC VOLUME BY PORT ---"
  wait_for_query "$Q_VOL" "accepted-volume"

  echo ""; echo "--- RST FLAG PATTERNS ---"
  wait_for_query "$Q_RST" "rst-flags"

  echo ""; echo "--- POTENTIAL ZERO WINDOW STALLS (low bytes/packet) ---"
  wait_for_query "$Q_ZW" "zero-window"

  if [ -n "$DNS_LOG_GROUP" ]; then
    echo ""; echo "=== ROUTE 53 RESOLVER QUERY LOGS: NXDOMAIN AND SERVFAIL ==="
    Q_DNS=$(aws logs start-query \
      --log-group-name "$DNS_LOG_GROUP" \
      --start-time "$START_MS" --end-time "$END_MS" \
      --query-string 'fields @timestamp, query_name, rcode, srcids.instance, vpc_id
        | filter rcode = "NXDOMAIN" or rcode = "SERVFAIL"
        | stats count(*) as errorCount by query_name, rcode, srcids.instance
        | sort errorCount desc | limit 50' \
      --region "$REGION" --query 'queryId' --output text 2>/dev/null) || Q_DNS=""
    sleep 10
    wait_for_query "$Q_DNS" "dns-nxdomain"
  else
    echo ""; echo "=== ROUTE 53 RESOLVER QUERY LOGS: NOT CONFIGURED ==="
    echo "Pass --dns-log-group /aws/route53resolver/query-logs to include DNS query analysis."
  fi

  echo ""; echo "=== ROUTE 53 RESOLVER CLOUDWATCH METRICS ==="
  START_ISO=$(python3 -c "from datetime import datetime, timedelta
print((datetime.utcnow() - timedelta(minutes=${MINUTES_BACK})).strftime('%Y-%m-%dT%H:%M:%SZ'))")
  END_ISO=$(python3 -c "from datetime import datetime
print(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))")

  ENDPOINT_IDS=$(aws route53resolver list-resolver-endpoints \
    --region "$REGION" --query 'ResolverEndpoints[].Id' --output text 2>/dev/null || true)

  if [ -n "$ENDPOINT_IDS" ]; then
    for EP in $ENDPOINT_IDS; do
      echo "--- Resolver endpoint: $EP ---"
      for METRIC in NxDomainQueries ServFailQueries TimeoutQueries P90ResponseTime; do
        aws cloudwatch get-metric-statistics \
          --namespace AWS/Route53Resolver \
          --metric-name "$METRIC" \
          --dimensions Name=EndpointId,Value="$EP" \
          --start-time "$START_ISO" --end-time "$END_ISO" \
          --period 300 --statistics Sum Average --region "$REGION" \
          --query 'sort_by(Datapoints,&Timestamp)[*].[Timestamp,Sum,Average]' \
          --output text 2>/dev/null | \
          awk -v m="$METRIC" '{printf "  %s: %s sum=%s avg=%s\n", m, $1, $2, $3}' | tail -12
      done
    done
  else
    echo "No Route 53 Resolver endpoints found in $REGION"
  fi

fi  # SKIP_FLOWLOGS

# ── Load balancers ────────────────────────────────────────────────────────────

if [ "$SKIP_LB" = "false" ]; then

  echo ""; echo "=== ALL LOAD BALANCERS ==="
  aws elbv2 describe-load-balancers \
    --region "$REGION" \
    --query 'LoadBalancers[].{Name:LoadBalancerName,Type:Type,State:State.Code,DNS:DNSName,AZ:AvailabilityZones}' \
    --output json

  echo ""; echo "=== TARGET GROUP HEALTH ==="
  TG_ARNS=$(aws elbv2 describe-target-groups \
    --region "$REGION" --query 'TargetGroups[].TargetGroupArn' --output text)

  for TG_ARN in $TG_ARNS; do
    TG_NAME=$(aws elbv2 describe-target-groups \
      --target-group-arns "$TG_ARN" --region "$REGION" \
      --query 'TargetGroups[0].TargetGroupName' --output text)
    echo ""; echo "--- Target group: $TG_NAME ---"
    aws elbv2 describe-target-health \
      --target-group-arn "$TG_ARN" --region "$REGION" --output json
  done

  echo ""; echo "=== LOAD BALANCER CLOUDWATCH METRICS ==="
  METRIC_START=$(python3 -c "from datetime import datetime, timedelta
print((datetime.utcnow() - timedelta(minutes=${MINUTES_BACK})).strftime('%Y-%m-%dT%H:%M:%SZ'))")
  METRIC_END=$(python3 -c "from datetime import datetime
print(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))")

  LB_ARNS=$(aws elbv2 describe-load-balancers \
    --region "$REGION" --query 'LoadBalancers[].LoadBalancerArn' --output text)

  for LB_ARN in $LB_ARNS; do
    LB_SUFFIX=$(echo "$LB_ARN" | awk -F':loadbalancer/' '{print $2}')
    LB_NAME=$(basename "$LB_ARN")
    echo ""; echo "--- UnHealthyHostCount: $LB_NAME ---"
    for NS in AWS/ApplicationELB AWS/NetworkELB; do
      aws cloudwatch get-metric-statistics \
        --namespace "$NS" \
        --metric-name UnHealthyHostCount \
        --dimensions Name=LoadBalancer,Value="$LB_SUFFIX" \
        --start-time "$METRIC_START" --end-time "$METRIC_END" \
        --period 60 --statistics Maximum --region "$REGION" \
        --query 'sort_by(Datapoints,&Timestamp)[*].[Timestamp,Maximum]' \
        --output text 2>/dev/null | \
        awk '{printf "  UnHealthyHostCount: %s = %s\n", $1, $2}' | tail -15 && break || true
    done
  done

fi  # SKIP_LB

}  # end collect_evidence()

if [ "$COLLECT_ONLY" = "true" ]; then
  collect_evidence
else
  collect_evidence | ./bedrock-ask.sh "$ANALYSIS_QUESTION"
fi
EOF
chmod +x ./diag-network.sh

6. Kubernetes and Container Diagnostics

For EKS environments, the diagnostic surface is wider because you are looking at both the AWS control plane and the Kubernetes data plane. The following script covers node and pod state, CoreDNS health and configuration, and heterogeneous DNS resolution in a single pass.

There is a class of Kubernetes DNS failure that is particularly difficult to diagnose because it affects only some queries, affects only some pods intermittently, and produces errors in applications that look nothing like DNS problems. The most common source is the ndots:5 default in every pod’s /etc/resolv.conf. When a pod queries api.stripe.com, the kernel first appends each search domain in sequence before sending the bare name, generating five DNS queries instead of one. On a busy cluster this can saturate CoreDNS long before the cluster appears busy by any other metric.

A second dangerous failure mode occurs when the CoreDNS Corefile forwards queries for internal AWS resources to a public upstream resolver rather than to the VPC resolver. Private hosted zone records are not visible from the public internet and return NXDOMAIN from any resolver outside the VPC. When this misconfiguration exists, internal service names resolve correctly from the EC2 node itself but silently fail from inside pods, producing the appearance of a network connectivity problem rather than a DNS problem.

A third scenario involves the VPC DHCP options being changed after the cluster was created. The Corefile’s forward . /etc/resolv.conf directive reads the node’s resolver configuration at CoreDNS startup, not continuously, so a DHCP change takes effect on new CoreDNS pods but leaves running pods using the stale configuration until they are restarted.

cat > ./diag-kubernetes.sh << 'EOF'
#!/bin/bash
# diag-kubernetes.sh
# Collects and analyses all Kubernetes and container evidence in a single pass:
#   - Node status, pressure conditions, and resource utilisation
#   - Pod state: CrashLoopBackOff, ImagePullBackOff, OOMKilled, high restart counts
#   - CoreDNS pod health, Corefile, logs, and endpoint registration
#   - DNS path analysis: ndots amplification, private zone routing, DHCP misalignment
#
# Usage:
#   ./diag-kubernetes.sh [--context <ctx>] [--namespace <ns>]
#                        [--skip-coredns] [--skip-dns-paths] [--collect-only]
#
# Examples:
#   ./diag-kubernetes.sh
#   ./diag-kubernetes.sh --context prod-diagnostics-my-cluster
#   ./diag-kubernetes.sh --namespace payments --skip-dns-paths
#   ./diag-kubernetes.sh --collect-only > evidence.txt   # save without calling Bedrock

set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"

CONTEXT="${K8S_CONTEXT:-}"
NAMESPACE=""
SKIP_COREDNS=false
SKIP_DNS_PATHS=false
COLLECT_ONLY=false

while [[ $# -gt 0 ]]; do
  case "$1" in
    --context)        CONTEXT="$2";         shift 2 ;;
    --namespace)      NAMESPACE="$2";       shift 2 ;;
    --skip-coredns)   SKIP_COREDNS=true;    shift ;;
    --skip-dns-paths) SKIP_DNS_PATHS=true;  shift ;;
    --collect-only)   COLLECT_ONLY=true;    shift ;;
    *) echo "Unknown option: $1"; exit 1 ;;
  esac
done

K8S_FLAGS=""
[ -n "$CONTEXT" ] && K8S_FLAGS="--context=$CONTEXT"

NS_FLAG="--all-namespaces"
[ -n "$NAMESPACE" ] && NS_FLAG="-n $NAMESPACE"

ANALYSIS_QUESTION="Analyse this combined Kubernetes evidence for a live production incident.

NODE AND POD HEALTH: identify nodes under memory or disk pressure that may be evicting pods; pods in CrashLoopBackOff and their exit codes or termination reasons (a pod exiting with code 137 is OOMKilled; code 1 is an application error; code 2 is a misuse of shell builtins); ImagePullBackOff pods indicating registry authentication failure or a missing image tag; OOMKilled events showing containers hitting memory limits; Warning events that preceded the incident. Pay attention to restart patterns: a pod that restarts repeatedly with a specific exit code tells a different story than one failing a liveness probe.

COREDNS: DNS failures manifest as connection timeouts, unknown host errors, or intermittent failures that appear random at the application layer. Identify: CoreDNS pods not running or restarting; SERVFAIL or NXDOMAIN errors in logs indicating upstream resolver failures or misconfigured forwarders; high CPU or memory suggesting query volume saturation; endpoints missing from the kube-dns service; ConfigMap changes breaking forwarding rules; under-provisioned CoreDNS replica count relative to cluster size.

DNS PATH ANALYSIS: identify heterogeneous DNS resolution conditions where some queries succeed and others fail. Specifically: whether VPC has enableDnsSupport and enableDnsHostnames; whether private Route 53 zones are associated with the pod's VPC (a zone that exists but is not associated returns NXDOMAIN silently); whether the Corefile forward directive hardcodes an IP that may have changed after a DHCP options update; whether ndots:5 is generating 5 DNS queries per external call and saturating CoreDNS; whether internal RDS or ElastiCache endpoints are being forwarded to a public upstream resolver rather than the VPC resolver. The last scenario produces application connection failures that look like network problems rather than DNS problems and is the most dangerous to miss."

collect_evidence() {

# ── Pre-check ─────────────────────────────────────────────────────────────────

if ! kubectl $K8S_FLAGS cluster-info --request-timeout=10s &>/dev/null; then
  echo "ERROR: Cannot reach Kubernetes API server." >&2
  echo "Check: kubeconfig is configured, context '$CONTEXT' is set, VPN/network is up." >&2
  echo "Run: aws eks update-kubeconfig --name <cluster> --region $REGION" >&2
  exit 1
fi
echo "Cluster connectivity: OK"

# ── Node state ────────────────────────────────────────────────────────────────

echo ""; echo "=== NODE STATUS ==="
kubectl $K8S_FLAGS get nodes -o wide

echo ""; echo "=== NODE RESOURCE PRESSURE ==="
kubectl $K8S_FLAGS describe nodes | \
  grep -A5 "Conditions:" | \
  grep -E "(MemoryPressure|DiskPressure|PIDPressure|Ready|NotReady)"

echo ""; echo "=== NODE RESOURCE UTILISATION ==="
kubectl $K8S_FLAGS top nodes 2>/dev/null || echo "Metrics server not available"

# ── Pod state ─────────────────────────────────────────────────────────────────

echo ""; echo "=== PODS NOT RUNNING ==="
kubectl $K8S_FLAGS get pods $NS_FLAG \
  --field-selector='status.phase!=Running' -o wide 2>/dev/null | head -100

echo ""; echo "=== PODS WITH HIGH RESTART COUNTS ==="
kubectl $K8S_FLAGS get pods $NS_FLAG -o json | python3 -c "
import json, sys
data = json.load(sys.stdin)
high = []
for pod in data.get('items', []):
  ns   = pod['metadata']['namespace']
  name = pod['metadata']['name']
  for c in pod.get('status', {}).get('containerStatuses', []):
    r = c.get('restartCount', 0)
    if r > 3:
      high.append({
        'namespace': ns, 'pod': name, 'container': c['name'], 'restarts': r,
        'state': list(c.get('state', {}).keys()),
        'last_reason': c.get('lastState', {}).get('terminated', {}).get('reason', 'unknown'),
        'exit_code':   c.get('lastState', {}).get('terminated', {}).get('exitCode', 'unknown')
      })
high.sort(key=lambda x: x['restarts'], reverse=True)
for p in high[:30]:
  print(json.dumps(p))
"

echo ""; echo "=== CRASHLOOPBACKOFF PODS ==="
kubectl $K8S_FLAGS get pods $NS_FLAG | grep -i "CrashLoopBackOff" || \
  echo "No CrashLoopBackOff pods"

echo ""; echo "=== IMAGEPULLBACKOFF PODS ==="
kubectl $K8S_FLAGS get pods $NS_FLAG | grep -i "ImagePullBackOff\|ErrImagePull" || \
  echo "No ImagePullBackOff pods"

echo ""; echo "=== OOMKILLED EVENTS (recent) ==="
kubectl $K8S_FLAGS get events $NS_FLAG --sort-by='.lastTimestamp' | \
  grep -i "OOMKill\|oom\|killed" | tail -30 || echo "No OOMKill events"

echo ""; echo "=== ALL WARNING EVENTS (recent) ==="
kubectl $K8S_FLAGS get events $NS_FLAG \
  --field-selector='type=Warning' --sort-by='.lastTimestamp' | tail -50

echo ""; echo "=== POD RESOURCE UTILISATION ==="
kubectl $K8S_FLAGS top pods $NS_FLAG --sort-by=memory 2>/dev/null | head -30 || \
  echo "Metrics server not available"

# ── CoreDNS ───────────────────────────────────────────────────────────────────

if [ "$SKIP_COREDNS" = "false" ]; then

  echo ""; echo "=== COREDNS POD STATUS ==="
  kubectl $K8S_FLAGS get pods -n kube-system -l k8s-app=kube-dns -o wide

  echo ""; echo "=== COREDNS RESOURCE USAGE ==="
  kubectl $K8S_FLAGS top pods -n kube-system -l k8s-app=kube-dns 2>/dev/null || \
    echo "Metrics server unavailable"

  echo ""; echo "=== COREDNS CONFIGMAP ==="
  kubectl $K8S_FLAGS get configmap coredns -n kube-system -o yaml

  echo ""; echo "=== COREDNS LOGS (errors and warnings) ==="
  COREDNS_PODS=$(kubectl $K8S_FLAGS get pods -n kube-system \
    -l k8s-app=kube-dns -o jsonpath='{.items[*].metadata.name}')
  for POD in $COREDNS_PODS; do
    echo "--- Pod: $POD ---"
    kubectl $K8S_FLAGS logs "$POD" -n kube-system --tail=200 2>/dev/null | \
      grep -iE "error|SERVFAIL|refused|timeout|panic|NXDOMAIN|i/o timeout|no route" | \
      tail -50 || \
    kubectl $K8S_FLAGS logs "$POD" -n kube-system --tail=100 2>/dev/null || \
      echo "Could not retrieve logs for $POD"
  done

  echo ""; echo "=== KUBE-DNS SERVICE AND ENDPOINTS ==="
  kubectl $K8S_FLAGS get svc kube-dns -n kube-system -o yaml
  kubectl $K8S_FLAGS get endpoints kube-dns -n kube-system -o yaml

  echo ""; echo "=== COREDNS DEPLOYMENT REPLICA STATE ==="
  kubectl $K8S_FLAGS get deployment coredns -n kube-system -o yaml | \
    grep -A10 "replicas\|status"

  echo ""; echo "=== COREDNS HPA ==="
  kubectl $K8S_FLAGS get hpa -n kube-system 2>/dev/null || echo "No HPA in kube-system"

  echo ""; echo "=== DNS RESOLUTION TEST FROM CLUSTER ==="
  TEST_POD="dns-test-$(date +%s)"
  kubectl $K8S_FLAGS run "$TEST_POD" --image=busybox:1.28 --rm --restart=Never -it \
    -- sh -c '
    echo "--- internal: kubernetes.default ---"
    nslookup kubernetes.default
    echo ""
    echo "--- external: s3.amazonaws.com ---"
    nslookup s3.amazonaws.com
    echo ""
    echo "--- timing unqualified vs trailing-dot (search expansion check) ---"
    time nslookup s3.amazonaws.com
    echo "---"
    time nslookup s3.amazonaws.com.
  ' 2>/dev/null || echo "Could not run DNS test pod"

  echo ""; echo "=== COREDNS PROMETHEUS METRICS SNAPSHOT ==="
  for POD in $COREDNS_PODS; do
    echo "--- $POD ---"
    kubectl $K8S_FLAGS exec "$POD" -n kube-system -- \
      wget -qO- http://localhost:9153/metrics 2>/dev/null | \
      grep -E "^coredns_dns_(requests_total|responses_total|forward_requests_total|forward_healthcheck_failures_total)" | \
      head -30 || echo "Metrics endpoint not reachable inside pod"
  done

fi  # SKIP_COREDNS

# ── DNS path analysis ─────────────────────────────────────────────────────────

if [ "$SKIP_DNS_PATHS" = "false" ]; then

  echo ""; echo "=== VPC DNS SETTINGS ==="
  VPC_IDS=$(aws ec2 describe-vpcs --region "$REGION" \
    --query 'Vpcs[].VpcId' --output text)
  for VPC in $VPC_IDS; do
    echo "--- VPC: $VPC ---"
    aws ec2 describe-vpc-attribute --vpc-id "$VPC" \
      --attribute enableDnsSupport --region "$REGION" --output json 2>/dev/null
    aws ec2 describe-vpc-attribute --vpc-id "$VPC" \
      --attribute enableDnsHostnames --region "$REGION" --output json 2>/dev/null
    DHCP_ID=$(aws ec2 describe-vpcs --vpc-ids "$VPC" --region "$REGION" \
      --query 'Vpcs[0].DhcpOptionsId' --output text)
    echo "  DHCP options for $VPC:"
    aws ec2 describe-dhcp-options --dhcp-options-ids "$DHCP_ID" \
      --region "$REGION" \
      --query 'DhcpOptions[0].DhcpConfigurations' \
      --output json 2>/dev/null
  done

  echo ""; echo "=== ROUTE 53 PRIVATE HOSTED ZONES AND VPC ASSOCIATIONS ==="
  aws route53 list-hosted-zones \
    --query 'HostedZones[?Config.PrivateZone==`true`].{Name:Name,ID:Id,RecordCount:ResourceRecordSetCount}' \
    --output json

  PRIVATE_ZONE_IDS=$(aws route53 list-hosted-zones \
    --query 'HostedZones[?Config.PrivateZone==`true`].Id' --output text)
  for ZONE_ID in $PRIVATE_ZONE_IDS; do
    SHORT_ID=$(echo "$ZONE_ID" | sed 's|/hostedzone/||')
    echo "--- Zone $SHORT_ID VPC associations ---"
    aws route53 get-hosted-zone --id "$SHORT_ID" \
      --query 'VPCs' --output json 2>/dev/null || echo "Could not retrieve zone details"
  done

  echo ""; echo "=== ROUTE 53 RESOLVER RULES AND ENDPOINTS ==="
  aws route53resolver list-resolver-rules \
    --region "$REGION" --output json 2>/dev/null || echo "No resolver rules"
  aws route53resolver list-resolver-endpoints \
    --region "$REGION" --output json 2>/dev/null || echo "No resolver endpoints"

  echo ""; echo "=== POD resolv.conf SAMPLES (ndots and search domains) ==="
  for NS in $(kubectl $K8S_FLAGS get namespaces \
      -o jsonpath='{.items[*].metadata.name}' 2>/dev/null | tr ' ' '\n' | \
      grep -v "kube-\|cert-\|external-dns" | head -5); do
    SAMPLE_POD=$(kubectl $K8S_FLAGS get pods -n "$NS" \
      -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || true)
    if [ -n "$SAMPLE_POD" ]; then
      echo "--- $NS/$SAMPLE_POD ---"
      kubectl $K8S_FLAGS exec "$SAMPLE_POD" -n "$NS" -- \
        cat /etc/resolv.conf 2>/dev/null || echo "  (exec not available)"
    fi
  done

  echo ""; echo "=== DNS PATH COMPARISON TEST ==="
  DNS_TEST_POD="dns-path-test-$(date +%s)"
  kubectl $K8S_FLAGS run "$DNS_TEST_POD" -n default \
    --image=busybox:1.28 --rm --restart=Never -it -- sh -c '
    echo "--- ndots setting ---"
    grep ndots /etc/resolv.conf || echo "ndots not set (default is 5)"
    echo ""
    echo "--- search domains ---"
    grep search /etc/resolv.conf
    echo ""
    echo "--- search expansion: unqualified vs trailing-dot timing ---"
    echo "Unqualified (subject to search expansion across all domains):"
    time nslookup s3.amazonaws.com
    echo ""
    echo "Trailing dot (bypasses expansion, sends bare query directly):"
    time nslookup s3.amazonaws.com.
    echo ""
    echo "--- SRV record for kube-dns ---"
    nslookup -type=SRV _dns._udp.kube-dns.kube-system.svc.cluster.local || \
      echo "SRV lookup failed"
  ' 2>/dev/null || echo "Could not run DNS path test pod"

fi  # SKIP_DNS_PATHS

}  # end collect_evidence()

if [ "$COLLECT_ONLY" = "true" ]; then
  collect_evidence
else
  collect_evidence | ./bedrock-ask.sh "$ANALYSIS_QUESTION"
fi
EOF
chmod +x ./diag-kubernetes.sh

7. Database Diagnostics

Database issues during a production incident fall into three families: availability and connectivity failures, performance failures from slow queries or lock contention, and structural failures from missing indexes or bad execution plans. The following script covers all three in a single pass using Performance Insights, slow query logs via Logs Insights, and Aurora PostgreSQL Query Plan Management (QPM) state.

There is a category of Aurora PostgreSQL incident that is particularly insidious: query plan regression. The PostgreSQL planner switches from an efficient plan to an inefficient one, usually in response to a statistics update, a schema change, a parameter group modification, or a minor Aurora engine version upgrade. The new plan may involve a sequential scan where an index scan previously existed, or a hash join with a large in-memory hash table where a nested loop was previously used. Each of these consumes significantly more memory per connection, and on clusters with high connection counts the aggregate effect is rapid memory exhaustion followed by OOM events that look like instance sizing problems rather than query plan problems.

Aurora PostgreSQL includes the apg_plan_mgmt extension, which provides Query Plan Management. When enabled, QPM captures execution plans for qualifying SQL statements, allows you to approve specific plans, and enforces only approved plans regardless of what the planner would choose based on current statistics.

Pass a specific instance ID to run Performance Insights, slow query analysis, and QPM collection in addition to the account-wide overview. Set DB_ENDPOINT to include direct psql queries for QPM plan state.

cat > ./diag-rds.sh << 'EOF'
#!/bin/bash
# diag-rds.sh
# Collects and analyses all database evidence in a single pass:
#   RDS/Aurora state, events, CloudWatch metrics
#   Performance Insights: top SQL by DB load and top wait events
#   Slow query log analysis via Logs Insights
#   Aurora QPM plan state and memory growth indicators
# Usage:
#   ./diag-rds.sh                          # overview of all instances
#   ./diag-rds.sh <instance-id>            # PI + slow query + QPM for one instance
#   DB_ENDPOINT=<ep> ./diag-rds.sh <id>   # includes direct psql QPM queries
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
INSTANCE_ID="${1:-}"
DB_ENDPOINT="${DB_ENDPOINT:-}"
DB_USER="${DB_USER:-postgres}"
DB_NAME="${DB_NAME:-postgres}"
DB_PORT="${DB_PORT:-5432}"
ANALYSIS_HOURS="${ANALYSIS_HOURS:-24}"
MINUTES_BACK=$(( ANALYSIS_HOURS * 60 ))

START=$(python3 -c "from datetime import datetime,timedelta; print((datetime.utcnow()-timedelta(hours=${ANALYSIS_HOURS})).strftime('%Y-%m-%dT%H:%M:%SZ'))")
END=$(python3 -c "from datetime import datetime; print(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))")
START_MS=$(python3 -c "import time; print(int((time.time()-${MINUTES_BACK}*60)*1000))")
END_MS=$(python3 -c "import time; print(int(time.time()*1000))")

{

echo "=== RDS INSTANCE STATUS ==="
aws rds describe-db-instances --region "$REGION" \
  --query 'DBInstances[].{ID:DBInstanceIdentifier,Class:DBInstanceClass,Engine:Engine,EngineVersion:EngineVersion,Status:DBInstanceStatus,MultiAZ:MultiAZ,PerformanceInsights:PerformanceInsightsEnabled,MonitoringInterval:MonitoringInterval,PendingModified:PendingModifiedValues}' \
  --output json

echo "=== AURORA CLUSTER STATUS ==="
aws rds describe-db-clusters --region "$REGION" \
  --query 'DBClusters[].{ID:DBClusterIdentifier,Engine:Engine,Status:Status,Members:DBClusterMembers,ReaderEndpoint:ReaderEndpoint,WriterEndpoint:Endpoint,MultiAZ:MultiAZ}' \
  --output json

echo "=== RDS EVENTS (last ${ANALYSIS_HOURS}h) ==="
aws rds describe-events --start-time "$START" --region "$REGION" --output json

echo "=== RDS CLOUDWATCH METRICS ==="
INSTANCES="${INSTANCE_ID:-$(aws rds describe-db-instances --region "$REGION" --query 'DBInstances[].DBInstanceIdentifier' --output text)}"
for INSTANCE in $INSTANCES; do
  echo "--- Instance: $INSTANCE ---"
  for METRIC in CPUUtilization FreeableMemory DatabaseConnections ReadLatency WriteLatency ReadIOPS WriteIOPS FreeStorageSpace ReplicaLag; do
    VALUE=$(aws cloudwatch get-metric-statistics \
      --namespace AWS/RDS --metric-name "$METRIC" \
      --dimensions Name=DBInstanceIdentifier,Value="$INSTANCE" \
      --start-time "$START" --end-time "$END" --period 300 \
      --statistics Average Maximum --region "$REGION" \
      --query 'sort_by(Datapoints,&Timestamp)[-1].[Average,Maximum]' \
      --output text 2>/dev/null || echo "N/A N/A")
    echo "  $METRIC: avg=$(echo $VALUE|awk '{print $1}') max=$(echo $VALUE|awk '{print $2}')"
  done
done

if [ -n "$INSTANCE_ID" ]; then
  DBI_RESOURCE_ID=$(aws rds describe-db-instances \
    --db-instance-identifier "$INSTANCE_ID" --region "$REGION" \
    --query 'DBInstances[0].DbiResourceId' --output text 2>/dev/null) || DBI_RESOURCE_ID=""

  if [ -n "$DBI_RESOURCE_ID" ]; then
    echo "=== PERFORMANCE INSIGHTS: TOP SQL BY DB LOAD ==="
    aws pi get-resource-metrics \
      --service-type RDS --identifier "db:${DBI_RESOURCE_ID}" \
      --start-time "$START" --end-time "$END" --period-in-seconds 300 \
      --metric-queries '[{"Metric":"db.load.avg","GroupBy":{"Group":"db.sql","Dimensions":["db.sql.statement"],"Limit":10}}]' \
      --region "$REGION" --output json 2>/dev/null || echo "Performance Insights not enabled"

    echo "=== PERFORMANCE INSIGHTS: TOP WAIT EVENTS ==="
    aws pi get-resource-metrics \
      --service-type RDS --identifier "db:${DBI_RESOURCE_ID}" \
      --start-time "$START" --end-time "$END" --period-in-seconds 300 \
      --metric-queries '[{"Metric":"db.load.avg","GroupBy":{"Group":"db.wait_event","Limit":10}}]' \
      --region "$REGION" --output json 2>/dev/null || echo "Performance Insights unavailable"
  fi

  LOG_GROUP="/aws/rds/instance/${INSTANCE_ID}/slowquery"
  echo "=== SLOW QUERY LOG (last ${MINUTES_BACK}min) ==="
  SQ_QID=$(aws logs start-query \
    --log-group-name "$LOG_GROUP" \
    --start-time "$START_MS" --end-time "$END_MS" \
    --query-string 'fields @timestamp, @message
      | filter @message like /Query_time/
      | parse @message "# Query_time: * Lock_time: * Rows_sent: * Rows_examined: *" as qt,lt,rs,re
      | sort @timestamp desc | limit 50' \
    --region "$REGION" --query 'queryId' --output text 2>/dev/null) || SQ_QID=""

  if [ -n "$SQ_QID" ]; then
    sleep 8
    aws logs get-query-results --query-id "$SQ_QID" --region "$REGION" --output json 2>/dev/null || echo "Results unavailable"
  else
    echo "Slow query log '$LOG_GROUP' not found. Available log files:"
    aws rds describe-db-log-files --db-instance-identifier "$INSTANCE_ID" --region "$REGION" --output json 2>/dev/null || true
  fi

  echo "=== AURORA QPM: PARAMETER GROUP SETTINGS ==="
  PARAM_GROUP=$(aws rds describe-db-instances \
    --db-instance-identifier "$INSTANCE_ID" --region "$REGION" \
    --query 'DBInstances[0].DBParameterGroups[0].DBParameterGroupName' --output text 2>/dev/null) || PARAM_GROUP=""

  if [ -n "$PARAM_GROUP" ] && [ "$PARAM_GROUP" != "None" ]; then
    echo "Parameter group: $PARAM_GROUP"
    for PARAM in rds.enable_plan_management apg_plan_mgmt.capture_plan_baselines \
                  apg_plan_mgmt.use_plan_baselines apg_plan_mgmt.max_plans \
                  work_mem maintenance_work_mem effective_cache_size \
                  shared_buffers max_connections random_page_cost; do
      VALUE=$(aws rds describe-db-parameters \
        --db-parameter-group-name "$PARAM_GROUP" --region "$REGION" \
        --query "Parameters[?ParameterName=='${PARAM}'].{Value:ParameterValue,Source:Source}" \
        --output json 2>/dev/null | python3 -c "
import json,sys; d=json.load(sys.stdin)
print(json.dumps(d[0]) if d else 'not set')" 2>/dev/null || echo "not found")
      echo "  $PARAM: $VALUE"
    done
  fi

  echo "=== AURORA MEMORY/CPU (last ${ANALYSIS_HOURS}h) ==="
  for METRIC in CPUUtilization FreeableMemory SwapUsage DatabaseConnections; do
    echo "--- $METRIC ---"
    aws cloudwatch get-metric-statistics \
      --namespace AWS/RDS --metric-name "$METRIC" \
      --dimensions Name=DBInstanceIdentifier,Value="$INSTANCE_ID" \
      --start-time "$START" --end-time "$END" --period 300 \
      --statistics Average Maximum --region "$REGION" \
      --query 'sort_by(Datapoints,&Timestamp)[*].[Timestamp,Average,Maximum]' \
      --output text 2>/dev/null | tail -24
  done

  echo "=== RDS EVENTS: RESTARTS/OOM/FAILOVER ==="
  aws rds describe-events \
    --source-identifier "$INSTANCE_ID" --source-type db-instance \
    --start-time "$START" --region "$REGION" \
    --query 'Events[?contains(Message,`restart`)||contains(Message,`recovery`)||contains(Message,`failover`)||contains(Message,`OOM`)||contains(Message,`memory`)].{Time:Date,Message:Message}' \
    --output json 2>/dev/null

  if [ -n "$DB_ENDPOINT" ]; then
    echo "=== QPM PLAN STATE (direct psql) ==="
    psql -h "$DB_ENDPOINT" -U "$DB_USER" -d "$DB_NAME" -p "$DB_PORT" \
      --no-password -t -A -F'|' << 'SQLEOF' 2>/dev/null || echo "psql failed - set DB_ENDPOINT and configure .pgpass"
SELECT extname, extversion FROM pg_extension WHERE extname = 'apg_plan_mgmt';
SELECT status, enabled, count(*) as plan_count FROM apg_plan_mgmt.dba_plans GROUP BY status, enabled;
SELECT sql_hash, plan_hash, status, enabled, last_used, first_used, calls,
       CASE WHEN calls > 0 THEN total_plan_time_ms/calls ELSE 0 END as avg_ms
FROM apg_plan_mgmt.dba_plans
WHERE last_used > now()-interval '24 hours' OR first_used > now()-interval '24 hours'
ORDER BY last_used DESC NULLS LAST LIMIT 30;
SELECT sql_hash, plan_hash, status, calls FROM apg_plan_mgmt.dba_plans
WHERE (status='Rejected' OR status='Unapproved') AND calls > 0 ORDER BY calls DESC LIMIT 20;
SELECT name, setting, unit FROM pg_settings
WHERE name IN ('work_mem','maintenance_work_mem','effective_cache_size','max_connections',
               'shared_buffers','random_page_cost','enable_hashjoin','enable_seqscan');
SELECT datname, temp_files, temp_bytes,
       CASE WHEN blks_read+blks_hit>0 THEN round(100.0*blks_hit/(blks_read+blks_hit),2) ELSE 0 END as cache_hit_pct
FROM pg_stat_database WHERE datname NOT IN ('template0','template1') ORDER BY temp_bytes DESC;
SELECT schemaname, tablename, n_live_tup, n_dead_tup,
       CASE WHEN n_live_tup>0 THEN round(100.0*n_dead_tup/n_live_tup,1) ELSE 0 END as dead_pct,
       last_analyze, last_autoanalyze, last_autovacuum
FROM pg_stat_user_tables WHERE n_dead_tup>10000 ORDER BY n_dead_tup DESC LIMIT 20;
SQLEOF
  else
    echo "DB_ENDPOINT not set. Set: export DB_ENDPOINT=<cluster-writer-endpoint> and configure .pgpass"
  fi
fi

} | ./bedrock-ask.sh "Analyse this RDS and Aurora database evidence for a production incident.

For instance state: identify instances not in available status, recent failover or restart events, CPU or memory saturation, connection counts approaching max_connections, high read or write latency, storage running low, and replica lag on read replicas. Flag any instance without Performance Insights enabled.

For Performance Insights: the DB load metric represents average active sessions; values above the vCPU count indicate saturation. io/table_lock_wait means a query is blocked on a missing index. io/file/redo means write I/O saturation. Identify top SQL statements by load and correlate with the dominant wait event.

For slow queries: high Rows_examined relative to Rows_sent is the signature of a full table scan or under-selective index. High Lock_time indicates lock contention. Recommend specific index candidates based on query WHERE clauses where possible.

For Aurora QPM and memory growth: check whether rds.enable_plan_management=1 and apg_plan_mgmt.use_plan_baselines=On. If not, any statistics update or engine upgrade can silently change plans. Default work_mem of 4MB causes any sort or hash join larger than 4MB to spill to disk; on clusters with many concurrent connections this produces gradual memory exhaustion that looks like an instance sizing problem. Plans with status Rejected or Unapproved still accumulating calls mean QPM enforcement is incomplete. A plan whose first_used timestamp matches the incident start is a strong indicator that a plan change caused the incident. High temp_bytes in pg_stat_database confirms work_mem spill. Tables with high dead_tup_pct and stale last_analyze have unreliable cardinality estimates, a common trigger for plan regression. If FreeableMemory shows a declining trend that began before the alert fired, the root cause is plan regression or work_mem misconfiguration, not increased traffic."
EOF
chmod +x ./diag-rds.sh

8. S3 Diagnostics

S3 failures during a production incident usually fall into three categories: access denial from a changed bucket policy or IAM permission, throttling from an application making too many requests to a prefix without random prefix distribution, and data integrity issues from a versioning or lifecycle policy change that has removed expected objects.

cat > ./diag-s3.sh << 'EOF'
#!/bin/bash
# diag-s3.sh
# Usage: ./diag-s3.sh [bucket-name-filter]
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
BUCKET_FILTER="${1:-}"
ANALYSIS_HOURS="${ANALYSIS_HOURS:-24}"

START=$(python3 -c "from datetime import datetime,timedelta; print((datetime.utcnow()-timedelta(hours=${ANALYSIS_HOURS})).strftime('%Y-%m-%dT%H:%M:%SZ'))")
END=$(python3 -c "from datetime import datetime; print(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))")

{
echo "=== ACCOUNT-LEVEL PUBLIC ACCESS BLOCK ==="
aws s3control get-public-access-block \
  --account-id "$(aws sts get-caller-identity --query Account --output text)" \
  --output json 2>/dev/null || echo "Not configured"

echo "=== S3 BUCKET LIST ==="
aws s3api list-buckets --query 'Buckets[].{Name:Name,Created:CreationDate}' --output json

for BUCKET in $(aws s3api list-buckets --query 'Buckets[].Name' --output text); do
  [ -n "$BUCKET_FILTER" ] && [[ "$BUCKET" != *"$BUCKET_FILTER"* ]] && continue
  echo "--- Bucket: $BUCKET ---"
  for METRIC in 4xxErrors 5xxErrors TotalRequestLatency AllRequests BytesDownloaded BytesUploaded; do
    aws cloudwatch get-metric-statistics \
      --namespace AWS/S3 --metric-name "$METRIC" \
      --dimensions Name=BucketName,Value="$BUCKET" Name=FilterId,Value=EntireBucket \
      --start-time "$START" --end-time "$END" --period 300 --statistics Sum Average \
      --region "$REGION" \
      --query 'sort_by(Datapoints,&Timestamp)[-1].[Sum,Average]' \
      --output text 2>/dev/null | awk -v m="$METRIC" '{printf "  %s: sum=%s avg=%s\n",m,$1,$2}'
  done
  aws s3api get-public-access-block --bucket "$BUCKET" --output json 2>/dev/null || echo "  (public access block not configured)"
  aws s3api get-bucket-versioning --bucket "$BUCKET" --output json 2>/dev/null || echo "  (versioning check failed)"
  aws s3api get-bucket-lifecycle-configuration --bucket "$BUCKET" \
    --query 'Rules[].{ID:ID,Status:Status,Expiration:Expiration,Transitions:Transitions}' \
    --output json 2>/dev/null || echo "  (no lifecycle rules)"
done

} | ./bedrock-ask.sh "Analyse this S3 data during a production incident where applications may be failing to read or write objects. Identify: buckets with high 4xx error rates (access denial or invalid requests; almost always means a bucket policy or IAM change occurred recently), buckets with elevated 5xx error rates (S3 service-side throttling), buckets without public access block configured, lifecycle rules with short expiration windows that might have recently deleted objects the application expects to find, versioning disabled on buckets where it should be enabled, and latency patterns in TotalRequestLatency that might indicate a prefix hotspot from non-random key distribution."
EOF
chmod +x ./diag-s3.sh

9. Cache Diagnostics: ElastiCache and DAX

Cache failures are among the most dangerous production incidents because they are frequently invisible until they cascade. A cache that silently degrades has a different failure signature from one that is hard down: hit rates fall slowly, evictions climb, application latency increases, and database connection counts rise as cache misses drive through to RDS or DynamoDB.

There are three distinct cache failure modes to check. The first is memory pressure, where the cache is running out of space and evicting keys. In a session cache or a read-through cache, evictions mean data the application expected to find is not there. If the eviction rate is high enough, the database receives every request as a miss. The second is replication lag, where replica nodes are serving stale data. The third is cluster resharding, where ElastiCache is redistributing slots across shards.

For ElastiCache Redis, EngineCPUUtilization is the metric to watch for CPU, not CPUUtilization. Redis is single-threaded for most operations, so CPUUtilization reflects the entire node across all cores and significantly underrepresents the load on the Redis process itself. A cluster showing 25% CPUUtilization may have its Redis engine thread at 90%.

For DAX, the cache hit rate must stay above 90% for the cluster to provide any material benefit. Below that threshold, the volume of DynamoDB pass-through requests begins to consume cluster resources without reducing DynamoDB load, and the cluster can become a bottleneck rather than an accelerator.

cat > ./diag-cache.sh << 'EOF'
#!/bin/bash
# diag-cache.sh
# Covers ElastiCache and DAX in a single pass.
# Note: use EngineCPUUtilization for Redis CPU, not CPUUtilization.
# Redis is single-threaded; CPUUtilization reflects all cores and underrepresents Redis load.
# Usage: ./diag-cache.sh
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
ANALYSIS_HOURS="${ANALYSIS_HOURS:-24}"

START=$(python3 -c "from datetime import datetime,timedelta; print((datetime.utcnow()-timedelta(hours=${ANALYSIS_HOURS})).strftime('%Y-%m-%dT%H:%M:%SZ'))")
END=$(python3 -c "from datetime import datetime; print(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))")

{
echo "=== ELASTICACHE CLUSTERS ==="
aws elasticache describe-cache-clusters --show-cache-node-info --region "$REGION" \
  --query 'CacheClusters[*].{ID:CacheClusterId,Engine:Engine,EngineVersion:EngineVersion,NodeType:CacheNodeType,Status:CacheClusterStatus,NumNodes:NumCacheNodes,ReplicationGroupId:ReplicationGroupId}' \
  --output json

echo "=== ELASTICACHE REPLICATION GROUPS ==="
aws elasticache describe-replication-groups --region "$REGION" \
  --query 'ReplicationGroups[*].{ID:ReplicationGroupId,Status:Status,ClusterMode:ClusterEnabled,MultiAZ:MultiAZ,AutoFailover:AutomaticFailover,NodeGroups:NodeGroups[*].{ID:NodeGroupId,Status:Status,Slots:Slots,Members:NodeGroupMembers[*].{ID:CacheClusterId,Role:CurrentRole,Status:CurrentStatus}}}' \
  --output json

echo "=== ELASTICACHE EVENTS (last ${ANALYSIS_HOURS}h) ==="
aws elasticache describe-events --region "$REGION" --start-time "$START" \
  --query 'Events[*].{Time:Date,Source:SourceIdentifier,Message:Message}' --output json

echo "=== CLOUDWATCH: ELASTICACHE METRICS ==="
CLUSTER_IDS=$(aws elasticache describe-cache-clusters --region "$REGION" \
  --query 'CacheClusters[].CacheClusterId' --output text 2>/dev/null | tr '\t' '\n') || CLUSTER_IDS=""

for CLUSTER_ID in $CLUSTER_IDS; do
  echo "--- Cluster: $CLUSTER_ID ---"
  for METRIC in CacheHits CacheMisses CacheHitRate Evictions CurrConnections \
                FreeableMemory DatabaseMemoryUsagePercentage MemoryFragmentationRatio \
                EngineCPUUtilization CPUUtilization ReplicationLag SwapUsage; do
    DATA=$(aws cloudwatch get-metric-statistics \
      --namespace AWS/ElastiCache --metric-name "$METRIC" \
      --dimensions Name=CacheClusterId,Value="$CLUSTER_ID" \
      --start-time "$START" --end-time "$END" --period 300 \
      --statistics Average Maximum Sum --region "$REGION" \
      --query 'sort_by(Datapoints,&Timestamp)[-1].[Average,Maximum,Sum]' \
      --output text 2>/dev/null || echo "N/A")
    echo "  $METRIC: $DATA"
  done
done

echo "=== REPLICATION LAG AND RESHARDING STATUS ==="
RG_IDS=$(aws elasticache describe-replication-groups --region "$REGION" \
  --query 'ReplicationGroups[].ReplicationGroupId' --output text 2>/dev/null | tr '\t' '\n') || RG_IDS=""

for RG_ID in $RG_IDS; do
  STATUS=$(aws elasticache describe-replication-groups \
    --replication-group-id "$RG_ID" --region "$REGION" \
    --query 'ReplicationGroups[0].Status' --output text 2>/dev/null)
  echo "--- Replication group: $RG_ID status=$STATUS ---"
  [ "$STATUS" = "modifying" ] && echo "  *** RESHARDING IN PROGRESS: write latency elevated on migrating slots"
  REPLICA_IDS=$(aws elasticache describe-replication-groups \
    --replication-group-id "$RG_ID" --region "$REGION" \
    --query 'ReplicationGroups[0].NodeGroups[*].NodeGroupMembers[?CurrentRole==`replica`].CacheClusterId' \
    --output text | tr '\t' '\n')
  for REPLICA_ID in $REPLICA_IDS; do
    LAG=$(aws cloudwatch get-metric-statistics \
      --namespace AWS/ElastiCache --metric-name ReplicationLag \
      --dimensions Name=CacheClusterId,Value="$REPLICA_ID" \
      --start-time "$START" --end-time "$END" --period 300 \
      --statistics Average Maximum --region "$REGION" \
      --query 'sort_by(Datapoints,&Timestamp)[-1].[Average,Maximum]' \
      --output text 2>/dev/null || echo "N/A N/A")
    echo "  $REPLICA_ID: avg_lag=$(echo $LAG|awk '{print $1}')s max_lag=$(echo $LAG|awk '{print $2}')s"
  done
done

echo "=== SLOW LOG CONFIGURATION ==="
for CLUSTER_ID in $CLUSTER_IDS; do
  LOG_CONFIG=$(aws elasticache describe-cache-clusters \
    --cache-cluster-id "$CLUSTER_ID" --region "$REGION" \
    --query 'CacheClusters[0].LogDeliveryConfigurations' --output json 2>/dev/null) || LOG_CONFIG=""
  if [ "$LOG_CONFIG" = "[]" ] || [ -z "$LOG_CONFIG" ]; then
    echo "  $CLUSTER_ID: slow logs NOT configured (cannot identify high-latency commands without them)"
  else
    echo "  $CLUSTER_ID: log delivery configured"
  fi
done

echo "=== DAX CLUSTERS ==="
aws dax describe-clusters --region "$REGION" \
  --query 'Clusters[*].{Name:ClusterName,Status:Status,NodeType:NodeType,TotalNodes:TotalNodes,ActiveNodes:ActiveNodes}' \
  --output json 2>/dev/null || echo "No DAX clusters or insufficient permissions"

echo "=== CLOUDWATCH: DAX METRICS ==="
DAX_NAMES=$(aws dax describe-clusters --region "$REGION" \
  --query 'Clusters[].ClusterName' --output text 2>/dev/null | tr '\t' '\n' || echo "")
for DAX_NAME in $DAX_NAMES; do
  echo "--- DAX cluster: $DAX_NAME ---"
  for METRIC in ItemCacheHits ItemCacheMisses QueryCacheHits QueryCacheMisses \
                TotalRequestCount ErrorRequestCount ThrottledRequestCount CPUUtilization; do
    DATA=$(aws cloudwatch get-metric-statistics \
      --namespace AWS/DAX --metric-name "$METRIC" \
      --dimensions Name=ClusterName,Value="$DAX_NAME" \
      --start-time "$START" --end-time "$END" --period 300 \
      --statistics Sum Average --region "$REGION" \
      --query 'sort_by(Datapoints,&Timestamp)[-1].[Average,Sum]' \
      --output text 2>/dev/null || echo "N/A")
    echo "  $METRIC: $DATA"
  done
done

} | ./bedrock-ask.sh "Analyse this ElastiCache and DAX cache data for a production incident.

For ElastiCache Redis: a hit rate below 80% means the cache is not absorbing load and most requests are falling through to the database, producing rising DatabaseConnections on RDS and increased DynamoDB read units simultaneously. Check whether a recent eviction spike caused the hit rate to drop. Check MemoryFragmentationRatio: above 1.5 the OS has allocated significantly more memory than Redis uses for data; above 2.0 is severe and requires activedefrag. Always check EngineCPUUtilization not CPUUtilization; Redis is single-threaded and a cluster can show 25% CPUUtilization while the engine thread is at 90%. ReplicationLag above 10 seconds means replicas are serving stale data. If resharding is in progress (status=modifying), write latency will be elevated on migrating slots; check whether this event correlates with the incident start time.

For DAX: item or query hit rate below 90% means the cluster is adding latency to the request path without reducing DynamoDB load, making it a net bottleneck. ThrottledRequestCount above zero means DAX is rate-limiting requests. ErrorRequestCount and FaultRequestCount above zero indicate client errors and internal DAX errors respectively."
EOF
chmod +x ./diag-cache.sh

10. Security and Compliance Sweep

cat > ./diag-security.sh << 'EOF'
#!/bin/bash
# diag-security.sh
# Usage: ./diag-security.sh
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"

{
echo "=== CLOUDTRAIL STATUS ==="
aws cloudtrail describe-trails --region "$REGION" \
  --query 'trailList[*].{Name:Name,S3Bucket:S3BucketName,MultiRegion:IsMultiRegionTrail,LogValidation:LogFileValidationEnabled}' \
  --output json
aws cloudtrail get-trail-status --name default --region "$REGION" --output json 2>/dev/null || echo "Default trail not found"

echo "=== ACM CERTIFICATE STATUS ==="
aws acm list-certificates --region "$REGION" \
  --query 'CertificateSummaryList[*].{ARN:CertificateArn,Domain:DomainName,Status:Status}' --output json

echo "=== WAF COVERAGE ON ALBS ==="
aws wafv2 list-web-acls --scope REGIONAL --region "$REGION" \
  --query 'WebACLs[*].{Name:Name,ARN:ARN}' --output json

echo "=== SSM MANAGED INSTANCE COVERAGE ==="
ALL=$(aws ec2 describe-instances --region "$REGION" \
  --query 'Reservations[*].Instances[*].InstanceId' --output text)
SSM=$(aws ssm describe-instance-information --region "$REGION" \
  --query 'InstanceInformationList[*].InstanceId' --output text)
python3 -c "
import sys
all_ids=set(sys.argv[1].split()); ssm_ids=set(sys.argv[2].split())
unmanaged=all_ids-ssm_ids
print(f'EC2: {len(all_ids)} total, {len(ssm_ids)} SSM-managed, {len(unmanaged)} unmanaged')
for i in sorted(unmanaged): print(f'  {i}')
" "$ALL" "$SSM" 2>/dev/null

echo "=== LAMBDA FUNCTION RUNTIMES ==="
aws lambda list-functions --region "$REGION" \
  --query 'Functions[*].{Name:FunctionName,Runtime:Runtime,Memory:MemorySize,Timeout:Timeout,LastModified:LastModified}' \
  --output json

echo "=== CLOUDWATCH ALARMS IN ALARM STATE ==="
aws cloudwatch describe-alarms --state-value ALARM --region "$REGION" \
  --query 'MetricAlarms[*].{Name:AlarmName,Metric:MetricName,Namespace:Namespace,Reason:StateReason}' \
  --output json
echo "Total alarms: $(aws cloudwatch describe-alarms --region "$REGION" --query 'MetricAlarms|length(@)' --output text 2>/dev/null || echo unknown)"

} | ./bedrock-ask.sh "Analyse this security and compliance data for a production AWS account. Identify: CloudTrail trails not logging or without log file validation; ACM certificates expired or expiring within 30 days (expired means immediate HTTPS failures); internet-facing ALBs not protected by WAF; EC2 instances not in SSM (cannot be patched or remotely accessed); Lambda functions on deprecated runtimes approaching end of support; CloudWatch alarms currently in ALARM state (treat each as a potential contributor to the current incident); services with no alarms configured (observability blind spots)."
EOF
chmod +x ./diag-security.sh

11. Auth, Identity, and Certificate Diagnostics

Authentication and authorisation failures are responsible for a disproportionate share of high-severity incidents relative to how rarely they appear in diagnostic checklists. An IAM policy change that removes a permission an application was relying on, an STS session duration that expires during a long-running batch job, a Cognito user pool that starts rejecting tokens because of a client configuration drift, an ACM certificate that expired while the account owner was on leave: all of these produce failures that the application reports as connection errors, permission denied responses, or generic 5xx codes with no obvious link to auth.

Short-lived connections deserve specific attention. Modern AWS architectures use temporary credentials extensively: EKS pods using IRSA, Lambda execution roles, EC2 instance profiles, ECS task roles. When the rotation mechanism fails because the metadata service is unreachable, the IRSA annotation is misconfigured, or a network policy is blocking the STS endpoint, the application continues using the cached credential until it expires, then fails with an authentication error. The failure is intermittent at first, becoming systematic as the cached credentials age out across all running instances.

cat > ./diag-auth.sh << 'EOF'
#!/bin/bash
# diag-auth.sh
# Covers ACM certificate expiry, AssumeRole and AccessDenied failures,
# Cognito metrics, short-lived credential risk, and ALB TLS cert cross-reference.
# Usage: ./diag-auth.sh
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
ANALYSIS_HOURS="${ANALYSIS_HOURS:-24}"

START=$(python3 -c "from datetime import datetime,timedelta; print((datetime.utcnow()-timedelta(hours=${ANALYSIS_HOURS})).strftime('%Y-%m-%dT%H:%M:%SZ'))")
END=$(python3 -c "from datetime import datetime; print(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))")

{
echo "=== ACM CERTIFICATE EXPIRY ==="
CERT_ARNS=$(aws acm list-certificates --region "$REGION" \
  --query 'CertificateSummaryList[*].CertificateArn' --output json | \
  python3 -c "import json,sys; print('\n'.join(json.load(sys.stdin)))" 2>/dev/null)

for CERT_ARN in $CERT_ARNS; do
  aws acm describe-certificate --certificate-arn "$CERT_ARN" --region "$REGION" \
    --query 'Certificate.{Domain:DomainName,Status:Status,NotAfter:NotAfter,RenewalStatus:RenewalSummary.RenewalStatus,InUse:InUseBy,ValidationMethod:DomainValidationOptions[0].ValidationMethod}' \
    --output json 2>/dev/null | python3 -c "
import json,sys,datetime
c=json.load(sys.stdin)
na=c.get('NotAfter','')
if na:
    try:
        exp=datetime.datetime.fromisoformat(na.replace('Z',''))
        days=(exp-datetime.datetime.utcnow()).days
        flag='*** EXPIRED' if days<0 else ('*** CRITICAL' if days<7 else ('WARNING' if days<30 else 'OK'))
        print(f\"  {c.get('Domain','')}: {flag} ({days}d) renewal={c.get('RenewalStatus','N/A')} in_use={len(c.get('InUse') or [])}\")
    except: pass
" 2>/dev/null
done

echo "=== ASSUMEROLE FAILURES (last ${ANALYSIS_HOURS}h) ==="
aws cloudtrail lookup-events \
  --start-time "$START" --end-time "$END" --region "$REGION" \
  --lookup-attributes AttributeKey=EventName,AttributeValue=AssumeRole \
  --max-results 50 --output json 2>/dev/null | python3 -c "
import json,sys
data=json.load(sys.stdin); failures=[]
for e in data.get('Events',[]):
    detail=json.loads(e.get('CloudTrailEvent','{}'))
    error=detail.get('errorCode','')
    if error:
        failures.append({'time':e.get('EventTime',''),'error':error,
                         'role':detail.get('requestParameters',{}).get('roleArn','')})
print(f'AssumeRole failures: {len(failures)}')
for f in failures[:20]: print(f\"  {f['time']} | {f['error']:30s} | {f['role']}\")
" 2>/dev/null

echo "=== ACCESS DENIED EVENTS (last ${ANALYSIS_HOURS}h) ==="
aws cloudtrail lookup-events \
  --start-time "$START" --end-time "$END" --region "$REGION" \
  --lookup-attributes AttributeKey=ErrorCode,AttributeValue=AccessDenied \
  --max-results 100 --output json 2>/dev/null | python3 -c "
import json,sys
from collections import Counter
data=json.load(sys.stdin); events=data.get('Events',[])
by_action=Counter(); by_principal=Counter()
for e in events:
    detail=json.loads(e.get('CloudTrailEvent','{}'))
    action=detail.get('eventSource','').replace('.amazonaws.com','') + ':' + detail.get('eventName','')
    principal=detail.get('userIdentity',{}).get('arn','unknown')
    by_action[action]+=1; by_principal[principal]+=1
print(f'Total AccessDenied: {len(events)}')
print('Top denied actions:')
for a,n in by_action.most_common(10): print(f'  {a}: {n}')
print('Top denied principals:')
for p,n in by_principal.most_common(5): print(f'  {p}: {n}')
" 2>/dev/null

echo "=== COGNITO USER POOLS ==="
POOLS=$(aws cognito-idp list-user-pools --max-results 20 --region "$REGION" \
  --query 'UserPools[*].{ID:Id,Name:Name,Status:Status}' --output json 2>/dev/null) || POOLS="[]"
echo "$POOLS"
POOL_IDS=$(echo "$POOLS" | python3 -c "
import json,sys
try: print('\n'.join(p['ID'] for p in json.load(sys.stdin)))
except: pass" 2>/dev/null) || POOL_IDS=""

for POOL_ID in $POOL_IDS; do
  echo "--- Pool $POOL_ID ---"
  for METRIC in SignInSuccesses SignInThrottles TokenRefreshSuccesses TokenRefreshThrottles; do
    DATA=$(aws cloudwatch get-metric-statistics \
      --namespace AWS/Cognito --metric-name "$METRIC" \
      --dimensions Name=UserPool,Value="$POOL_ID" \
      --start-time "$START" --end-time "$END" --period 300 --statistics Sum \
      --region "$REGION" --query 'sort_by(Datapoints,&Timestamp)[-1].Sum' \
      --output text 2>/dev/null) || DATA="N/A"
    echo "  $METRIC: $DATA"
  done
done

echo "=== SHORT-LIVED CREDENTIAL EXPIRY RISK ==="
aws iam list-roles \
  --query 'Roles[*].{Name:RoleName,MaxSession:MaxSessionDuration}' --output json 2>/dev/null | python3 -c "
import json,sys
roles=json.load(sys.stdin)
short=[r for r in roles if r.get('MaxSession',3600)<=3600]
print(f'Roles with max session <=1h: {len(short)}')
for r in short[:10]: print(f\"  {r['Name']}: {r['MaxSession']}s ({r['MaxSession']//60}min)\")
" 2>/dev/null

echo "=== TLS CERTIFICATE ARNs ON ALB LISTENERS ==="
LB_ARNS=$(aws elbv2 describe-load-balancers --region "$REGION" \
  --query 'LoadBalancers[].LoadBalancerArn' --output text 2>/dev/null | tr '\t' '\n')
for LB_ARN in $LB_ARNS; do
  LB_NAME=$(echo "$LB_ARN" | awk -F/ '{print $(NF-1)}')
  aws elbv2 describe-listeners --load-balancer-arn "$LB_ARN" --region "$REGION" \
    --query 'Listeners[?Protocol==`HTTPS`||Protocol==`TLS`].{Port:Port,Protocol:Protocol,Certs:Certificates[*].CertificateArn}' \
    --output json 2>/dev/null | python3 -c "
import json,sys
for l in json.load(sys.stdin):
    print(f\"  {sys.argv[1]}:{l['Port']} ({l['Protocol']}) certs={l.get('Certs',[])}\")
" "$LB_NAME" 2>/dev/null
done

} | ./bedrock-ask.sh "Analyse this authentication, identity, and certificate data for a production incident.

ACM certificates: flag expired (immediate TLS failures), expiring within 7 days (critical; auto-renewal may have silently failed), or within 30 days with RenewalStatus not PENDING_VALIDATION. ACM auto-renewal fails when DNS validation records have been removed or the certificate is not associated with an active load balancer.

IAM failures: a cluster of AssumeRole failures on a specific role ARN means a service lost the ability to assume that role, usually because the trust policy was modified or the role was deleted. AccessDenied clusters on a specific action across multiple principals indicate an SCP or resource policy change removed a widely-relied-upon permission. A sudden increase correlating with the incident start time is strong evidence that a policy change caused the incident.

Short-lived credentials: roles with MaxSessionDuration of 3600 seconds are at risk during operations longer than one hour. When the credential expires mid-operation the application receives ExpiredTokenException. Flag any role used by batch workloads or long-running ECS/EKS tasks with session duration below 4 hours.

Cognito: SignInThrottles above zero means the user pool is rate-limiting auth. A sudden drop in SignInSuccesses without a corresponding rise in throttles suggests the pool is rejecting requests, possibly from a Lambda trigger failure or client configuration change.

TLS cross-reference: match the certificate ARNs on HTTPS and TLS listeners against the expired or near-expiry certs in the ACM section. A load balancer serving an expired certificate produces connection errors that look like network failures rather than certificate failures."
EOF
chmod +x ./diag-auth.sh

12. CI/CD and Release Causality

The change timeline from CloudTrail tells you what infrastructure changed. CI/CD causality goes one layer deeper: it asks not just what infrastructure changed but what deployment caused the change, what code it contained, and whether rolling back to the previous version would restore service. This distinction matters because a CloudTrail event showing UpdateService on an ECS service tells you that a deployment happened but not what it changed or whether it is the proximate cause of the current failure.

cat > ./diag-cicd.sh << 'EOF'
#!/bin/bash
# diag-cicd.sh
# Usage: ./diag-cicd.sh
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
ANALYSIS_HOURS="${ANALYSIS_HOURS:-24}"

START=$(python3 -c "from datetime import datetime,timedelta; print((datetime.utcnow()-timedelta(hours=${ANALYSIS_HOURS})).strftime('%Y-%m-%dT%H:%M:%SZ'))")
END=$(python3 -c "from datetime import datetime; print(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))")

{
echo "=== CODEPIPELINE EXECUTIONS (last ${ANALYSIS_HOURS}h) ==="
PIPELINE_NAMES=$(aws codepipeline list-pipelines --region "$REGION" \
  --query 'pipelines[*].name' --output json | \
  python3 -c "import json,sys; print('\n'.join(json.load(sys.stdin)))" 2>/dev/null) || PIPELINE_NAMES=""

for PIPELINE in $PIPELINE_NAMES; do
  echo "--- Pipeline: $PIPELINE ---"
  aws codepipeline list-pipeline-executions \
    --pipeline-name "$PIPELINE" --region "$REGION" \
    --query 'pipelineExecutionSummaries[:10].{ID:pipelineExecutionId,Status:status,StartTime:startTime,StopTime:lastUpdateTime,Trigger:trigger.triggerType}' \
    --output json 2>/dev/null | python3 -c "
import json,sys
for e in json.load(sys.stdin):
    s=e.get('Status','')
    m='*** FAILED' if s=='Failed' else ('*** SUPERSEDED' if s=='Superseded' else '')
    print(f\"  {e.get('StartTime','')} | {s:15s} {m} | {e.get('ID','')[:12]} | trigger={e.get('Trigger','')}\")
" 2>/dev/null
done

echo "=== CODEDEPLOY DEPLOYMENTS (last ${ANALYSIS_HOURS}h) ==="
APP_NAMES=$(aws deploy list-applications --region "$REGION" \
  --query 'applications' --output json | \
  python3 -c "import json,sys; print('\n'.join(json.load(sys.stdin)))" 2>/dev/null) || APP_NAMES=""

for APP in $APP_NAMES; do
  GROUPS=$(aws deploy list-deployment-groups \
    --application-name "$APP" --region "$REGION" \
    --query 'deploymentGroups' --output json | \
    python3 -c "import json,sys; print('\n'.join(json.load(sys.stdin)))" 2>/dev/null) || GROUPS=""
  for GROUP in $GROUPS; do
    echo "--- $APP / $GROUP ---"
    aws deploy list-deployments \
      --application-name "$APP" --deployment-group-name "$GROUP" \
      --create-time-range Start="$START",End="$END" \
      --region "$REGION" --query 'deployments' --output json 2>/dev/null | python3 -c "
import json,sys; d=json.load(sys.stdin)
print(f'  Deployments in window: {len(d)}')
for x in d[:5]: print(f'  {x}')
" 2>/dev/null
  done
done

echo "=== CLOUDFORMATION STACKS CHANGED (last ${ANALYSIS_HOURS}h) ==="
STACKS=$(aws cloudformation list-stacks --region "$REGION" \
  --stack-status-filter CREATE_COMPLETE UPDATE_COMPLETE UPDATE_ROLLBACK_COMPLETE \
                        ROLLBACK_COMPLETE UPDATE_ROLLBACK_FAILED \
  --query 'StackSummaries[*].{Name:StackName,Status:StackStatus,Updated:LastUpdatedTime}' \
  --output json 2>/dev/null)
echo "$STACKS" | python3 -c "
import json,sys
from datetime import datetime,timedelta,timezone
cutoff=datetime.utcnow().replace(tzinfo=timezone.utc)-timedelta(hours=${ANALYSIS_HOURS})
stacks=json.load(sys.stdin)
recent=[s for s in stacks if s.get('Updated','')>cutoff.isoformat()]
print(f'Stacks updated in last ${ANALYSIS_HOURS}h: {len(recent)}')
for s in recent[:20]:
    print(f\"  {s.get('Updated','')} | {s.get('Status',''):35s} | {s.get('Name','')}\")
" 2>/dev/null

echo "=== ECS SERVICES WITH MULTIPLE ACTIVE DEPLOYMENTS ==="
CLUSTERS=$(aws ecs list-clusters --region "$REGION" \
  --query 'clusterArns' --output text | tr '\t' '\n')
for CLUSTER in $CLUSTERS; do
  SERVICES=$(aws ecs list-services --cluster "$CLUSTER" --region "$REGION" \
    --query 'serviceArns' --output text 2>/dev/null | tr '\t' '\n') || continue
  for SERVICE in $SERVICES; do
    aws ecs describe-services --cluster "$CLUSTER" --services "$SERVICE" --region "$REGION" \
      --query 'services[0].{Name:serviceName,Running:runningCount,Desired:desiredCount,Pending:pendingCount,Deployments:deployments[*].{Status:status,Running:runningCount,Desired:desiredCount,CreatedAt:createdAt}}' \
      --output json 2>/dev/null | python3 -c "
import json,sys
s=json.load(sys.stdin)
if len(s.get('Deployments',[]))>1:
    print(f\"  {s.get('Name','')}: running={s.get('Running',0)} desired={s.get('Desired',0)} *** MULTIPLE DEPLOYMENTS ACTIVE\")
    for d in s.get('Deployments',[]):
        print(f\"    {d.get('Status',''):15s} running={d.get('Running',0)} desired={d.get('Desired',0)} at {d.get('CreatedAt','')}\")
" 2>/dev/null
  done
done

} | ./bedrock-ask.sh "Analyse this CI/CD and deployment data for a production incident. Identify: CodePipeline executions that failed or were superseded and whether their timing correlates with the incident start; CodeDeploy deployments that ran during the incident window; CloudFormation stacks in UPDATE_ROLLBACK_FAILED status (a failed update that could not roll back cleanly); CloudFormation stacks updated shortly before the incident; ECS services with multiple concurrent active deployments (a stuck deployment where the new task definition is not reaching healthy status, causing old and new tasks to serve traffic simultaneously); and the delta between the most recent successful deployment and the first signs of the incident. A deployment that completed 5-15 minutes before the incident started is a strong candidate for the cause even if the pipeline showed success."
EOF
chmod +x ./diag-cicd.sh

13. Appendix A: Bedrock Quota Management

Before running the diagnostic scripts on a live incident, check Bedrock service quotas. The default limits in most regions are aggressively low for the workload this guide generates.

cat > ./check-bedrock-quotas.sh << 'EOF'
#!/bin/bash
# check-bedrock-quotas.sh
# Usage: ./check-bedrock-quotas.sh
set -euo pipefail
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"

echo "=== BEDROCK SERVICE QUOTAS: $REGION ==="
aws service-quotas list-service-quotas \
  --service-code bedrock --region "$REGION" \
  --query 'Quotas[*].{Name:QuotaName,Value:Value,Unit:Unit}' \
  --output json 2>/dev/null | python3 -c "
import json,sys
for q in json.load(sys.stdin):
    print(f\"  {q.get('Value',0):8.0f} {q.get('Unit',''):10s} {q.get('Name','')}\")
" 2>/dev/null

echo ""
echo "Request a quota increase:"
echo "  https://console.aws.amazon.com/servicequotas/home/services/bedrock/quotas"
echo ""
echo "For same-day capacity without waiting for approval, use Provisioned Throughput:"
echo "  aws bedrock create-provisioned-model-throughput \\"
echo "    --model-id anthropic.claude-3-5-sonnet-20241022-v2:0 \\"
echo "    --provisioned-model-name prod-diagnostics \\"
echo "    --model-units 1 \\"
echo "    --region $REGION"
EOF
chmod +x ./check-bedrock-quotas.sh

If bedrock-ask.sh receives a ThrottlingException during an incident, it retries up to 5 times with exponential backoff starting at 5 seconds. If throttling persists after all retries, the investigation can continue by saving evidence to disk and replaying once quota recovers:

# Collect evidence without Bedrock analysis
./diag-network.sh > evidence/network-$(date +%s).txt 2>&1

# Once quota recovers, replay
cat evidence/network-*.txt | ./bedrock-ask.sh "Analyse this network evidence for a live production incident."

14. Appendix B: Baseline Snapshot

Capturing a healthy-state baseline makes every subsequent incident investigation materially better. The baseline gives Bedrock a normal reference for each metric and allows it to report deltas and anomalies rather than absolute values that require human calibration.

Run this during a known-good period. Pass the output directory to any diagnostic script via BASELINE_DIR.

cat > ./baseline-snapshot.sh << 'EOF'
#!/bin/bash
# baseline-snapshot.sh
# Run during a known-good period. Pass the output directory via BASELINE_DIR to
# any diagnostic script during an incident for delta-based anomaly analysis.
# Usage: ./baseline-snapshot.sh [output-dir]
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
OUTPUT_DIR="${1:-./baselines/$(date +%Y%m%d-%H%M)}"
mkdir -p "$OUTPUT_DIR"

echo "Capturing baseline snapshot to $OUTPUT_DIR"
echo "Timestamp: $(date -u +%Y-%m-%dT%H:%M:%SZ)" > "$OUTPUT_DIR/metadata.txt"
echo "Region: $REGION" >> "$OUTPUT_DIR/metadata.txt"

SEVEN_DAYS_AGO=$(date -u -d '7 days ago' '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null \
  || date -u -v-7d '+%Y-%m-%dT%H:%M:%SZ')
NOW=$(date -u '+%Y-%m-%dT%H:%M:%SZ')

echo "Capturing RDS baselines..."
for INSTANCE in $(aws rds describe-db-instances --region "$REGION" \
    --query 'DBInstances[].DBInstanceIdentifier' --output text | tr '\t' '\n'); do
  for METRIC in CPUUtilization FreeableMemory DatabaseConnections ReadLatency WriteLatency; do
    aws cloudwatch get-metric-statistics \
      --namespace AWS/RDS --metric-name "$METRIC" \
      --dimensions Name=DBInstanceIdentifier,Value="$INSTANCE" \
      --start-time "$SEVEN_DAYS_AGO" --end-time "$NOW" \
      --period 3600 --statistics Average Maximum Minimum \
      --region "$REGION" --output json 2>/dev/null \
      >> "$OUTPUT_DIR/rds-${INSTANCE}-${METRIC}.json"
  done
done

echo "Capturing ElastiCache baselines..."
for CLUSTER in $(aws elasticache describe-cache-clusters --region "$REGION" \
    --query 'CacheClusters[].CacheClusterId' --output text 2>/dev/null | tr '\t' '\n'); do
  for METRIC in CacheHitRate Evictions EngineCPUUtilization FreeableMemory; do
    aws cloudwatch get-metric-statistics \
      --namespace AWS/ElastiCache --metric-name "$METRIC" \
      --dimensions Name=CacheClusterId,Value="$CLUSTER" \
      --start-time "$SEVEN_DAYS_AGO" --end-time "$NOW" \
      --period 3600 --statistics Average Maximum \
      --region "$REGION" --output json 2>/dev/null \
      >> "$OUTPUT_DIR/elasticache-${CLUSTER}-${METRIC}.json"
  done
done

echo "Capturing ALB baselines..."
for LB_ARN in $(aws elbv2 describe-load-balancers --region "$REGION" \
    --query 'LoadBalancers[].LoadBalancerArn' --output text | tr '\t' '\n'); do
  LB_SUFFIX=$(echo "$LB_ARN" | awk -F':loadbalancer/' '{print $2}')
  LB_NAME=$(basename "$LB_ARN")
  for METRIC in UnHealthyHostCount TargetResponseTime HTTPCode_Target_5XX_Count; do
    aws cloudwatch get-metric-statistics \
      --namespace AWS/ApplicationELB --metric-name "$METRIC" \
      --dimensions Name=LoadBalancer,Value="$LB_SUFFIX" \
      --start-time "$SEVEN_DAYS_AGO" --end-time "$NOW" \
      --period 3600 --statistics Average Maximum \
      --region "$REGION" --output json 2>/dev/null \
      >> "$OUTPUT_DIR/alb-${LB_NAME}-${METRIC}.json"
  done
done

echo ""
echo "Baseline captured in $OUTPUT_DIR"
echo "Use during incidents:"
echo "  BASELINE_DIR=$OUTPUT_DIR ./diag-network.sh"
echo "  BASELINE_DIR=$OUTPUT_DIR ./diag-kubernetes.sh"
echo "  BASELINE_DIR=$OUTPUT_DIR ./diag-rds.sh <instance-id>"
EOF
chmod +x ./baseline-snapshot.sh

15. Disk and Storage Diagnostics

Storage failures have two distinct failure modes that are easy to conflate. The first is capacity exhaustion: the volume or filesystem is full and writes start failing. The second is performance exhaustion: the volume has space remaining but cannot service I/O fast enough, causing requests to queue. Both manifest in applications as write errors or latency spikes, but the remediation is completely different.

For RDS, gp2 volumes below 334GB have a baseline IOPS ceiling below 1002 and a burst bucket that drains under sustained load. Once the burst bucket is empty the volume drops to its baseline rate and stays there until the bucket refills, which happens at 3 credits per GB per second. A 100GB gp2 volume refills at 300 credits per second and drains at up to 3000 IOPS per second under burst; under any sustained workload the bucket will be empty within minutes. This produces latency spikes that look like query performance regressions and are frequently misattributed to missing indexes or lock contention.

For EC2, the instance type defines a hard ceiling on aggregate EBS IOPS and throughput that applies across all attached volumes combined. Provisioning volumes with more IOPS than the instance type can deliver means the excess provisioned capacity is invisible at the instance level; the throttling happens silently at the hypervisor layer and produces confusing latency behaviour under load. The mismatch analysis section of this script cross-references each instance’s provisioned volume IOPS and throughput against the instance type’s documented EBS limits to surface these mismatches explicitly.

cat > ./diag-disk.sh << 'EOF'
#!/bin/bash
# diag-disk.sh
# Covers RDS storage health and EC2/EBS disk diagnostics in a single pass.
# Detects: storage exhaustion, IOPS saturation, burst credit depletion,
# throughput ceiling, EBS type/size mismatches vs instance limits,
# RDS storage autoscaling state, gp2 burst bucket drain, and OS-level
# disk usage via CloudWatch agent metrics where available.
# Usage: ./diag-disk.sh
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
ANALYSIS_HOURS="${ANALYSIS_HOURS:-24}"

START=$(python3 -c "from datetime import datetime,timedelta; print((datetime.utcnow()-timedelta(hours=${ANALYSIS_HOURS})).strftime('%Y-%m-%dT%H:%M:%SZ'))")
END=$(python3 -c "from datetime import datetime; print(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))")

{

# ── RDS STORAGE ────────────────────────────────────────────────────────────

echo "=== RDS INSTANCE STORAGE CONFIGURATION ==="
aws rds describe-db-instances \
  --region "$REGION" \
  --query 'DBInstances[].{
    ID:DBInstanceIdentifier,
    Class:DBInstanceClass,
    Engine:Engine,
    StorageGB:AllocatedStorage,
    MaxStorageGB:MaxAllocatedStorage,
    StorageType:StorageType,
    ProvisionedIOPS:Iops,
    StorageThroughput:StorageThroughput,
    AutoscalingEnabled:MaxAllocatedStorage,
    MultiAZ:MultiAZ,
    Status:DBInstanceStatus}' \
  --output json

echo "=== RDS STORAGE CLOUDWATCH METRICS ==="
INSTANCES=$(aws rds describe-db-instances --region "$REGION" \
  --query 'DBInstances[].DBInstanceIdentifier' --output text)

for INSTANCE in $INSTANCES; do
  echo "--- RDS instance: $INSTANCE ---"

  # FreeStorageSpace: alert if trending toward zero
  for METRIC in FreeStorageSpace ReadIOPS WriteIOPS ReadThroughput WriteThroughput \
                DiskQueueDepth ReadLatency WriteLatency BurstBalance; do
    DATA=$(aws cloudwatch get-metric-statistics \
      --namespace AWS/RDS --metric-name "$METRIC" \
      --dimensions Name=DBInstanceIdentifier,Value="$INSTANCE" \
      --start-time "$START" --end-time "$END" \
      --period 300 --statistics Average Minimum Maximum \
      --region "$REGION" \
      --query 'sort_by(Datapoints,&Timestamp)[-6:].[Timestamp,Average,Minimum,Maximum]' \
      --output json 2>/dev/null | python3 -c "
import json,sys
d=json.load(sys.stdin)
if not d: print('  no data')
else:
    last=d[-1]
    trend='falling' if len(d)>1 and (last[1] or 0)<(d[0][1] or 0) else 'stable/rising'
    print(f'  latest: avg={last[1]} min={last[2]} max={last[3]} at {last[0]} | trend={trend}')
" 2>/dev/null || echo '  no data')
    echo "  $METRIC:"
    echo "$DATA"
  done

  # Provisioned IOPS vs actual: detect over-provisioned or saturated instances
  IOPS_PROV=$(aws rds describe-db-instances \
    --db-instance-identifier "$INSTANCE" --region "$REGION" \
    --query 'DBInstances[0].Iops' --output text 2>/dev/null || echo 'None')
  STORAGE_TYPE=$(aws rds describe-db-instances \
    --db-instance-identifier "$INSTANCE" --region "$REGION" \
    --query 'DBInstances[0].StorageType' --output text 2>/dev/null || echo 'unknown')
  STORAGE_GB=$(aws rds describe-db-instances \
    --db-instance-identifier "$INSTANCE" --region "$REGION" \
    --query 'DBInstances[0].AllocatedStorage' --output text 2>/dev/null || echo '0')
  echo "  StorageType=$STORAGE_TYPE ProvisionedIOPS=$IOPS_PROV AllocatedGB=$STORAGE_GB"

  # gp2 burst: baseline is 3 IOPS/GB, min 100, burst to 3000 for volumes <1TB
  # BurstBalance=0 means the bucket is empty and you are capped at baseline
  python3 -c "
import sys
try:
    gb=int(sys.argv[1]); stype=sys.argv[2]
    if stype=='gp2':
        baseline=max(100, gb*3)
        burst=3000 if gb<1000 else baseline
        print(f'  gp2 IOPS: baseline={baseline} burst_cap={burst}')
        if gb<334:
            print(f'  WARNING: volume <334GB on gp2 means baseline IOPS < 1002; consider gp3 for predictable performance')
    elif stype=='gp3':
        print(f'  gp3: baseline 3000 IOPS / 125 MiB/s included; additional IOPS provisioned separately')
    elif stype=='io1' or stype=='io2':
        print(f'  Provisioned IOPS storage: IOPS fixed at provisioned value, no burst')
except: pass
" "$STORAGE_GB" "$STORAGE_TYPE" 2>/dev/null

done

# ── EC2 / EBS ──────────────────────────────────────────────────────────────

echo "=== EC2 INSTANCES AND ATTACHED EBS VOLUMES ==="
aws ec2 describe-instances \
  --region "$REGION" \
  --query 'Reservations[*].Instances[*].{
    InstanceId:InstanceId,
    InstanceType:InstanceType,
    State:State.Name,
    Volumes:BlockDeviceMappings[*].{
      Device:DeviceName,
      VolumeId:Ebs.VolumeId,
      DeleteOnTermination:Ebs.DeleteOnTermination}}' \
  --output json

echo "=== EBS VOLUME DETAILS ==="
aws ec2 describe-volumes \
  --region "$REGION" \
  --query 'Volumes[*].{
    VolumeId:VolumeId,
    Type:VolumeType,
    SizeGB:Size,
    IOPS:Iops,
    Throughput:Throughput,
    State:State,
    MultiAttach:MultiAttachEnabled,
    Encrypted:Encrypted,
    AttachedTo:Attachments[0].InstanceId,
    Device:Attachments[0].Device}' \
  --output json

echo "=== EC2 INSTANCE TYPE EBS LIMITS (key types) ==="
# Pull instance type data to detect IOPS/throughput mismatches
INSTANCE_TYPES=$(aws ec2 describe-instances \
  --region "$REGION" \
  --query 'Reservations[*].Instances[?State.Name==`running`].InstanceType' \
  --output text | tr '\t' '\n' | sort -u)

for ITYPE in $INSTANCE_TYPES; do
  aws ec2 describe-instance-types \
    --instance-types "$ITYPE" \
    --region "$REGION" \
    --query 'InstanceTypes[0].{
      Type:InstanceType,
      EbsMaxBandwidthMbps:EbsInfo.EbsOptimizedInfo.MaximumBandwidthInMbps,
      EbsBaselineBandwidthMbps:EbsInfo.EbsOptimizedInfo.BaselineBandwidthInMbps,
      EbsMaxIOPS:EbsInfo.EbsOptimizedInfo.MaximumIops,
      EbsBaselineIOPS:EbsInfo.EbsOptimizedInfo.BaselineIops,
      EbsMaxThroughputMbps:EbsInfo.EbsOptimizedInfo.MaximumThroughputInMBps,
      EbsBaselineThroughputMbps:EbsInfo.EbsOptimizedInfo.BaselineThroughputInMBps,
      EbsOptimizedSupport:EbsInfo.EbsOptimizedSupport,
      NvmeSupport:EbsInfo.NvmeSupport}' \
    --output json 2>/dev/null || echo "  (no EBS limit data for $ITYPE)"
done

echo "=== IOPS/THROUGHPUT MISMATCH ANALYSIS ==="
# Cross-reference each attached volume's provisioned IOPS/throughput
# against the instance type's EBS limit ceiling
python3 - << 'PYEOF'
import subprocess, json, sys, os

region = os.environ.get('AWS_DEFAULT_REGION', 'ap-southeast-1')

def aws(args):
    r = subprocess.run(['aws'] + args + ['--region', region, '--output', 'json'],
                       capture_output=True, text=True,
                       env={**os.environ, 'AWS_PROFILE': 'prod-diagnostics'})
    try: return json.loads(r.stdout)
    except: return {}

instances = []
for res in aws(['ec2','describe-instances',
               '--query','Reservations[*].Instances[?State.Name==`running`]']):
    for inst in (res if isinstance(res, list) else []):
        instances.append(inst)

if not instances:
    print('  No running instances found or API call failed.')
    sys.exit(0)

volumes_raw = aws(['ec2','describe-volumes',
                   '--query','Volumes[*]'])
vol_map = {v['VolumeId']: v for v in (volumes_raw if isinstance(volumes_raw,list) else [])}

itype_cache = {}
def get_itype(itype):
    if itype in itype_cache: return itype_cache[itype]
    d = aws(['ec2','describe-instance-types','--instance-types',itype,
             '--query','InstanceTypes[0].EbsInfo.EbsOptimizedInfo'])
    itype_cache[itype] = d if isinstance(d,dict) else {}
    return itype_cache[itype]

issues = []
for inst in instances:
    iid   = inst.get('InstanceId','')
    itype = inst.get('InstanceType','')
    limits = get_itype(itype)
    max_iops = limits.get('MaximumIops') or 0
    base_iops = limits.get('BaselineIops') or 0
    max_bw   = limits.get('MaximumBandwidthInMbps') or 0
    base_bw  = limits.get('BaselineBandwidthInMbps') or 0

    total_prov_iops = 0
    total_prov_tp   = 0
    vol_details = []
    for bdm in inst.get('BlockDeviceMappings',[]):
        vid = (bdm.get('Ebs') or {}).get('VolumeId','')
        vol = vol_map.get(vid, {})
        vtype = vol.get('VolumeType','')
        size  = vol.get('Size', 0)
        viops = vol.get('Iops') or 0
        vtp   = vol.get('Throughput') or 0
        # gp2: IOPS = max(100, size*3), capped at 16000
        if vtype == 'gp2':
            viops = min(16000, max(100, size * 3))
            vtp   = min(250, size * 0.25)   # rough MiB/s estimate
        # gp3 baseline if not provisioned above baseline
        elif vtype == 'gp3':
            viops = max(3000, viops)
            vtp   = max(125, vtp)
        total_prov_iops += viops
        total_prov_tp   += vtp
        vol_details.append(f'{vid}({vtype},{size}GB,{viops}iops,{vtp}MBps)')

    print(f'\n  {iid} [{itype}]')
    print(f'    Instance EBS limits : baseline={base_iops}iops/{base_bw}Mbps  max={max_iops}iops/{max_bw}Mbps')
    print(f'    Total provisioned   : {total_prov_iops}iops / {total_prov_tp:.0f}MBps across {len(vol_details)} volumes')
    print(f'    Volumes             : {", ".join(vol_details)}')

    if max_iops and total_prov_iops > max_iops:
        pct = total_prov_iops / max_iops * 100
        issues.append(f'MISMATCH {iid}: provisioned {total_prov_iops} IOPS exceeds instance max {max_iops} ({pct:.0f}%)')
        print(f'    *** IOPS MISMATCH: volumes provision {total_prov_iops} IOPS but instance ceiling is {max_iops}')
        print(f'    *** Excess IOPS are silently capped at the instance level; throughput appears throttled')
    elif base_iops and total_prov_iops > base_iops:
        print(f'    NOTE: provisioned IOPS exceed baseline; burst capacity will be consumed under sustained load')

    if max_bw and total_prov_tp > max_bw:
        issues.append(f'MISMATCH {iid}: provisioned {total_prov_tp:.0f} MBps throughput exceeds instance max {max_bw} Mbps')
        print(f'    *** THROUGHPUT MISMATCH: volumes provision {total_prov_tp:.0f} MBps but instance ceiling is {max_bw} Mbps')

if issues:
    print('\n=== MISMATCH SUMMARY ===')
    for i in issues: print(f'  {i}')
else:
    print('\n  No IOPS/throughput mismatches detected.')
PYEOF

echo "=== EBS CLOUDWATCH METRICS ==="
VOLUME_IDS=$(aws ec2 describe-volumes \
  --region "$REGION" \
  --filters 'Name=status,Values=in-use' \
  --query 'Volumes[].VolumeId' --output text | tr '\t' '\n')

for VOL_ID in $VOLUME_IDS; do
  echo "--- EBS volume: $VOL_ID ---"
  for METRIC in VolumeReadOps VolumeWriteOps VolumeReadBytes VolumeWriteBytes \
                VolumeTotalReadTime VolumeTotalWriteTime VolumeIdleTime \
                VolumeQueueLength BurstBalance; do
    DATA=$(aws cloudwatch get-metric-statistics \
      --namespace AWS/EBS --metric-name "$METRIC" \
      --dimensions Name=VolumeId,Value="$VOL_ID" \
      --start-time "$START" --end-time "$END" \
      --period 300 --statistics Average Maximum Sum \
      --region "$REGION" \
      --query 'sort_by(Datapoints,&Timestamp)[-1].[Average,Maximum,Sum]' \
      --output text 2>/dev/null || echo 'N/A')
    echo "  $METRIC: $DATA"
  done
done

echo "=== OS-LEVEL DISK USAGE (CloudWatch agent) ==="
# Only available if CloudWatch agent is installed with disk metrics enabled.
# Checks /dev/xvda1, /dev/nvme0n1p1, and any other partitions reporting.
DISK_NAMESPACES='CWAgent CloudWatchAgent'
for NS in $DISK_NAMESPACES; do
  METRICS=$(aws cloudwatch list-metrics \
    --namespace "$NS" --metric-name 'disk_used_percent' \
    --region "$REGION" \
    --query 'Metrics[*].Dimensions' \
    --output json 2>/dev/null) || continue
  echo "$METRICS" | python3 -c "
import json,sys,subprocess,os
dims_list=json.load(sys.stdin)
if not dims_list:
    print('  No disk_used_percent metrics found in $NS namespace (agent not installed or not reporting)')
    sys.exit(0)
for dims in dims_list:
    dim_args=[]
    for d in dims: dim_args.extend([d['Name'],d['Value']])
    dim_str=' '.join(f'Name={k},Value={v}' for k,v in zip(dim_args[::2],dim_args[1::2]))
    label='/'.join(d['Value'] for d in dims if d['Name'] in ('path','device','host'))
    r=subprocess.run(['aws','cloudwatch','get-metric-statistics',
        '--namespace','$NS','--metric-name','disk_used_percent',
        '--dimensions'] + [f'Name={d[\"Name\"]},Value={d[\"Value\"]}' for d in dims] + [
        '--start-time','$START','--end-time','$END',
        '--period','300','--statistics','Average','Maximum',
        '--region','$REGION',
        '--query','sort_by(Datapoints,&Timestamp)[-1].[Average,Maximum]',
        '--output','text'],
        capture_output=True,text=True,
        env={**os.environ,'AWS_PROFILE':'prod-diagnostics'})
    vals=r.stdout.strip()
    if vals and vals != 'None':
        parts=vals.split()
        avg=float(parts[0]) if parts else 0
        flag='*** CRITICAL' if avg>90 else ('WARNING' if avg>80 else '')
        print(f'  {label}: avg={avg:.1f}% {flag}')
" 2>/dev/null || true
done

echo "=== RDS STORAGE AUTOSCALING STATE ==="
aws rds describe-db-instances \
  --region "$REGION" \
  --query 'DBInstances[].{
    ID:DBInstanceIdentifier,
    AllocatedGB:AllocatedStorage,
    MaxAutoscaleGB:MaxAllocatedStorage,
    StorageType:StorageType,
    ProvisionedIOPS:Iops}' \
  --output json | python3 -c "
import json,sys
instances=json.load(sys.stdin)
for inst in instances:
    max_auto=inst.get('MaxAutoscaleGB')
    alloc=inst.get('AllocatedGB',0)
    iid=inst.get('ID','')
    if max_auto:
        headroom=max_auto-alloc
        pct_used=alloc/max_auto*100 if max_auto else 0
        flag='*** LOW HEADROOM' if headroom<50 else ''
        print(f'  {iid}: allocated={alloc}GB max_autoscale={max_auto}GB headroom={headroom}GB ({pct_used:.0f}% of max) {flag}')
    else:
        print(f'  {iid}: storage autoscaling DISABLED (MaxAllocatedStorage not set)')
" 2>/dev/null

} | ./bedrock-ask.sh "Analyse this disk and storage evidence for a production incident.

For RDS storage: identify instances where FreeStorageSpace is trending toward zero, especially
if the trend is consistent across the analysis window rather than spiky. A sustained decline in
FreeStorageSpace without a corresponding spike in write activity suggests table bloat or
index growth rather than a write surge. Flag instances with storage autoscaling disabled that
are approaching their allocated limit. For gp2 volumes below 334GB, the baseline IOPS ceiling
is below 1002; a production database on a small gp2 volume will hit the burst bucket limit
under sustained load and then drop to baseline, producing latency spikes that look like query
performance regressions. BurstBalance approaching zero on a gp2 RDS volume is a definitive
signal that the volume is undersized for its workload and should be migrated to gp3 or io2.
High DiskQueueDepth combined with low BurstBalance confirms the burst bucket is the bottleneck.

For EBS volumes: VolumeQueueLength above 1 sustained over a 5-minute period indicates the
volume cannot service requests fast enough and I/O is queuing. This will surface in
applications as increased latency on any operation that hits disk. BurstBalance at zero on gp2
means the burst bucket is empty and the volume is running at its baseline 3 IOPS/GB rate.
For a 100GB gp2 volume this is 300 IOPS; most production workloads will saturate this
immediately. VolumeIdleTime near zero means the volume is busy 100% of the time, which
combined with high VolumeQueueLength confirms storage saturation.

For IOPS and throughput mismatches: the instance type defines a hard ceiling on aggregate EBS
IOPS and throughput regardless of what the volumes are provisioned for. Provisioning io2
volumes with 20000 IOPS on an instance type whose EBS maximum is 10000 IOPS means the excess
capacity is invisible at the instance level; under load the instance will cap at its own
ceiling and the provisioned IOPS on the volumes will never be reached. This is expensive and
produces confusing latency behaviour because the throttling happens silently at the hypervisor
layer. Identify any instance where the sum of provisioned IOPS across attached volumes exceeds
the instance type EBS IOPS maximum, and any instance where provisioned throughput exceeds the
instance type EBS throughput maximum.

For OS-level disk usage: disk_used_percent above 90% on any partition means the filesystem is
nearly full. On Linux this will cause write failures on the affected partition, which manifests
as application errors that look like permission denied or no space left on device. The /var
partition filling up will kill logging. The root partition filling up can prevent new processes
from starting. Flag any partition above 80% as a warning and above 90% as critical."
EOF
chmod +x ./diag-disk.sh

Published at andrewbaker.ninja. Source: https://andrewbaker.ninja/2026/06/13/the-hitchhikers-guide-to-production-outage-triage-with-amazon-bedrock/