The Hitchhiker’s Guide to Production Outage Triage with Amazon Bedrock

The Hitchhiker’s Guide to Production Outage Triage with Amazon Bedrock

👁38views

During a production outage, Amazon Bedrock functions as a structured reasoning layer that ingests CloudWatch logs, metrics, and trace data already captured in your AWS account, then applies probabilistic classification to surface the most likely failure root cause, dramatically reducing the mean time to diagnosis without replacing the human engineer making the final remediation decision.

CloudScale AI SEO - Article Summary
  • 1.
    What it is
    Amazon Bedrock outage triage shows how to run a readonly, evidence bounded reasoning layer inside your own AWS account that returns ranked hypotheses with confidence scores during a live production incident.
  • 2.
    Why it matters
    The article argues this three phase progressive deepening model addresses the three specific reasons outage investigations fail: cognitive load on on call engineers, incomplete coverage of distributed surface areas, and false coherence from incomplete evidence hardening into wrong conclusions.
  • 3.
    Key takeaway
    Bedrock never touches your AWS account directly and is structurally prevented from asserting causality without explicit supporting evidence in the collected text, making UNKNOWN a required output when evidence is absent rather than an optional fallback.
~39 min read

How to use a large language model inside your own AWS account to interrogate your infrastructure while it is on fire

So your production environment is throwing errors at 2 AM, your on-call engineer is staring at a wall of CloudWatch noise, and someone in the incident channel has already asked “has anyone checked the database?” for the fourth time. You will survive the outage. The more useful question is how long it takes you to find the specific thing that is broken, and whether you have the tooling to make that materially faster next time.

This guide is about using Amazon Bedrock as a structured reasoning layer over captured infrastructure evidence during a live production incident. Not as an autonomous operator, not as a magic oracle, but as a probabilistic classification engine that receives raw AWS data you have already collected and returns ranked hypotheses with confidence scores, supporting evidence, contradicting evidence, and the next best query to run. Every action it recommends passes through a human before anything in the environment changes.

The entire workflow runs from a readonly IAM role. Bedrock never touches your AWS account directly.

1. Architecture and Operating Model

Before any script is discussed, the architecture needs to be understood clearly, because it is the reason this approach is defensible in a production environment and the reason it produces useful output instead of confident hallucination.

The pipeline is:

Evidence collection → Blast radius estimate → Evidence prioritisation → Reasoning → Human decision → Execution

Each stage is strictly separated. The collection scripts call the AWS CLI in readonly mode, write everything to disk locally, and produce no side effects on the environment. The reasoning stage sends that locally stored text to Bedrock and receives structured hypotheses back, which a human then reviews before deciding what action to take. Nothing is executed automatically, no alert is suppressed, no scaling action is taken, and no security group is touched. The separation is not a convenience: it is the architectural property that makes this safe to run during a live incident.

Bedrock operates under a strict evidence contract enforced through the system prompt. It may only assert findings directly supported by text present in the evidence it received. If evidence is absent or ambiguous, it must return UNKNOWN rather than infer. It must provide a confidence score between 0 and 1 for each hypothesis, list the evidence that supports the hypothesis, list the evidence that contradicts or weakens it, and name the single next data point that would most increase confidence. This constraint is what separates structured diagnostic reasoning from plausible narrative generation.

1.1 Why outage triage fails without this

Most outage investigations fail to find root cause quickly for one of three reasons.

The first is cognitive load. An engineer managing a 2 AM incident is simultaneously handling the Slack channel, reading dashboards, responding to stakeholders, and trying to maintain a mental model of a distributed system they may not have touched in months. The pattern-matching capacity that makes senior engineers valuable degrades rapidly under this load.

The second is coverage. The evidence that identifies root cause is often in a service no one thought to check. It is in the flow logs no one looked at, the OpenSearch JVM heap metric that was never alarmed, the Aurora query plan that changed silently after a statistics update. No single engineer holds the full surface area of a production AWS account in their head simultaneously.

The third is false coherence. Incomplete evidence allows plausible but wrong narratives to form and harden into working hypotheses that waste investigation time. An engineer who concludes the database is slow because the CPU is high has constructed a coherent story that may have nothing to do with the actual cause.

Bedrock addresses all three by operating without cognitive load, covering all collected surface areas simultaneously, and being structurally prevented from asserting causality without timestamps and explicit supporting evidence.

1.2 What Bedrock must never do

This section is not optional reading.

Including it in the architecture description rather than as an afterthought is deliberate. Any production use of AI-assisted tooling that does not begin by defining the hard exclusions has not finished its architecture.

Bedrock must never, under any circumstances, take or initiate any of the following actions:

  • Restart workloads, instances, pods, or tasks
  • Modify autoscaling policies or trigger scale-in or scale-out events
  • Alter routing tables, security groups, NACLs, or network ACLs
  • Open any inbound rule in any security group
  • Modify DNS records or resolver rules
  • Suppress, silence, or acknowledge alerts or alarms
  • Create, modify, or close incident tickets automatically
  • Write remediation commands without explicit human review
  • Execute any AWS CLI command other than readonly calls
  • Make assumptions about what a human intended and act on them

The bedrock-ask.sh script in this guide invokes bedrock-runtime:InvokeModel only. It has no execution capability. The IAM role it uses is bounded to Describe*, Get*, List*, and logs:FilterLogEvents across all services. If you extend this guide, do not grant the diagnostic role write permissions. If a vendor tool or automation pipeline proposes granting Bedrock write access to investigate or remediate an incident, that proposal should be declined.

1.3 The progressive deepening model

The scripts in this guide can be run in sequence to achieve progressively deeper diagnosis. The first pass covers all active service surface areas and identifies anomalies. The second pass receives the first pass findings as context and investigates each anomaly in depth, forming hypotheses with confidence scores. The third pass applies 5-Whys reasoning to every unresolved issue and either confirms root cause or explicitly names the single remaining data point needed to do so.

Each phase has an operational discipline target, not just a mechanical description:

PhaseGoalTime targetSuccess criteria
DetectFind all abnormal systemsUnder 5 minutesIncident surface isolated, no false negatives
NarrowReduce candidate causesUnder 10 minutesThree or fewer likely hypotheses remaining
ConfirmCollect disconfirming evidenceUnder 15 minutesSingle RCA confidence above 80%

The time targets are achievable because evidence collection happens in parallel across all services and Bedrock processes the full surface area in a single reasoning pass rather than one service at a time.

1.4 Evidence quality and source trust

Not all evidence is equally trustworthy, and treating it as though it were is one of the most reliable paths to a wrong RCA. A single anomalous CloudWatch data point during a metric aggregation window is different from a sustained CloudTrail API call sequence. Application logs written by a service under memory pressure may be missing entries. Human notes added to the incident channel after the fact may be accurate summaries or post-hoc rationalisation. LLM reasoning output from a prior run of this tool is the least trustworthy evidence of all, because it already contains inference.

The bedrock-ask.sh system prompt instructs Bedrock to annotate every evidence citation with a trust level drawn from the following scale:

SourceTrustReason
CloudTrailVery highCryptographically signed, sequential, non-repudiable
VPC Flow LogsHighKernel-level capture, though delayed by capture interval
ALB/NLB access logsHighLow-level, complete per-request records
CloudWatch metricsMediumAggregated over period boundaries, can mask spikes
Application logsMediumApplication-controlled, may be missing under pressure
Human incident notesLowSubject to recency bias and post-hoc rationalisation
Prior Bedrock outputVery lowContains inference, must not be treated as ground truth

When Bedrock cites evidence in a hypothesis, it must include the trust level of that evidence and its estimated completeness. A hypothesis supported only by medium-trust evidence is scored differently from one supported by CloudTrail records. The structured output includes an evidence_quality block nested inside each piece of cited evidence. Here is what that looks like in a real response. This is Bedrock’s output, not something you configure or run:

{
  "evidence_quality": {
    "source": "cloudwatch_metrics",
    "trust": 0.72,
    "completeness": 0.85,
    "time_skew_seconds": 180,
    "notes": "5-minute aggregation period; spike within window may be masked"
  }
}

This prevents Bedrock from treating a noisy application log as the same class of signal as a CloudTrail ModifyDBInstance event that happened 10 minutes before the incident started.

1.5 Confidence and evidence scoring

Every Bedrock response in this guide uses a structured JSON output format that enforces explicit confidence calibration. When you pipe collected evidence through bedrock-ask.sh, each hypothesis in the response looks like this. You read it; you do not paste it anywhere:

{
  "hypothesis": "CoreDNS restarting because Corefile forward directive points at unreachable upstream",
  "confidence": 0.74,
  "supporting_evidence": [
    "CoreDNS pod restart count: 8 in last hour",
    "SERVFAIL responses in CoreDNS logs correlated with restart events",
    "Corefile forward directive: 10.100.0.2 (DHCP options changed 6h ago)"
  ],
  "contradicting_evidence": [
    "CoreDNS CPU is only 12% suggesting not overwhelmed by query volume",
    "kube-dns endpoints show 2 healthy pods registered"
  ],
  "next_best_query": "Compare current VPC DHCP options nameserver against the hardcoded IP in the CoreDNS Corefile to confirm mismatch"
}

A confidence score above 0.85 should be treated as confirmed pending human verification. Between 0.6 and 0.85 is a strong hypothesis requiring the named next best query. Below 0.6 is a candidate worth tracking but not worth prioritising over higher-confidence findings.

The explicit contradicting evidence field is the most important control against hallucination. Bedrock must actively argue against each hypothesis it raises, not just accumulate supporting evidence. A hypothesis with high supporting evidence and zero contradicting evidence should be treated with more suspicion, not less, because it means either the evidence is incomplete or the model failed to apply the constraint.

1.6 Blast radius drives evidence prioritisation

Blast radius is not an output you read at the end of the investigation. It is an input that shapes which evidence you collect first.

If payments are down and the admin UI is down simultaneously, the evidence you need is different from the evidence you need when only the admin UI is down. Collecting Aurora query plan data when the blast radius tells you the database tier is unaffected by the user-facing failure wastes time during the window that matters most.

The runbook therefore performs a fast blast radius estimate as its first Bedrock call, using only the ALB health and CloudWatch alarm data that can be collected in under two minutes. That estimate gates evidence prioritisation: it tells you which service families to investigate deeply and which to defer. The detailed service-by-service collection then proceeds in order of impact, not alphabetically or by script sequence.

The blast radius output that Bedrock produces carries its own confidence score. A low-confidence blast radius means the symptom surface is ambiguous and broad collection is warranted. A high-confidence blast radius means you can skip the low-impact areas and spend your Bedrock quota on the failure path.

1.7 Causal graph construction

The most common failure mode in AI-assisted incident diagnosis is correlating symptoms rather than tracing causes. If Bedrock sees Aurora latency and ALB 503s in the same evidence set, it may link them directly when the actual causal chain is longer and the fix is at a different layer entirely.

The system prompt instructs Bedrock to construct an intermediate causal graph before forming hypotheses, following the dependency direction explicitly:

Aurora latency spike (symptom)
  ↓ caused by
Connection pool exhaustion on application tier (intermediate cause)
  ↓ caused by
Long-running queries blocking connection release (proximate cause)
  ↓ caused by
Missing index on orders table after migration (root cause)
  → blast radius
ALB 503s because application cannot acquire DB connections (user impact)

Without this intermediate layer, Bedrock might correlate Aurora latency with ALB 503s and recommend scaling the database, which treats the symptom rather than the cause. The causal graph forces the model to identify the full propagation path and find the earliest point where intervention is possible.

The structured output includes a causal_graph field that surfaces this reasoning as a traceable chain. If any link in the chain has low confidence, the graph marks it explicitly, which tells you where additional evidence collection would most improve the overall hypothesis.

1.8 Known-good baseline comparison

Reasoning from current state alone is weak. The question is never whether a metric value is high or low in isolation; it is whether it differs from what is normal for this environment, at this time of day, on this day of the week.

The baseline-snapshot.sh script in section 14 captures a healthy-state snapshot of your environment. When the runbook runs during an incident, it can optionally accept a baseline directory as input, and the system prompt instructs Bedrock to produce an anomaly delta rather than an absolute assessment:

Current: RDS DatabaseConnections = 487
Baseline (same hour, previous 7 days): avg=142, p95=213, max=251
Delta: +336 connections above p95, outside 3-sigma range
Assessment: anomalous, not within normal variation

Without baseline comparison, a connection count of 487 looks alarming. With baseline, you know whether 487 is catastrophic or simply a Tuesday afternoon. The difference between those assessments changes whether you escalate, page additional engineers, and activate your incident communication plan.

1.9 Uncertainty budget and stop conditions

The single largest operational risk in iterative AI-assisted diagnosis is unbounded evidence gathering. The next_best_query field in each hypothesis is useful, but without a stop condition it produces an investigation that never terminates, because each new data point opens new questions.

The evidence contract enforces an explicit stop condition that Bedrock must apply to each hypothesis. In the response JSON, each hypothesis carries a stop_condition block that looks like this:

{
  "stop_condition": {
    "max_additional_queries": 3,
    "min_confidence_gain_per_query": 0.07,
    "current_confidence": 0.74,
    "queries_run": 1,
    "recommendation": "run_next_query"
  }
}

When current_confidence exceeds 0.85 or queries_run reaches max_additional_queries, the recommendation changes to escalate_to_human rather than run_next_query. This prevents the investigation from continuing past the point where additional evidence is expected to change the conclusion by less than 7 percentage points per query, which is the threshold below which diminishing returns have set in and human judgment should take over.

The stop condition also prevents runaway Bedrock API usage during an incident. An unbounded loop of next_best_query executions would exhaust the quota and leave the engineer with no reasoning capacity for the synthesis step that produces the final RCA.

1.10 Evidence time alignment

Production incidents involve at least three distinct timestamps for every event: the source timestamp when the event occurred, the ingest timestamp when a log aggregator recorded it, and the observation timestamp when a monitoring system detected it. These are frequently different by minutes and occasionally by more. Flow logs are delayed by the capture interval, typically 1-15 minutes. CloudWatch metrics aggregate over their period boundaries. Application logs may use local time with incorrect timezone configuration.

If Bedrock receives evidence with misaligned timestamps and is not explicitly warned about this, it can construct causal chains that are impossible: a database restart appearing to precede the application error that caused it, a security group change appearing to follow the connection failure it actually triggered.

The bedrock-ask.sh system prompt addresses this by instructing the model to treat all timestamps as approximate and to require at least a 5-minute corroboration window before asserting temporal causality. When you collect evidence, record the wall-clock time of collection separately from the data timestamps, and include it in the evidence header so Bedrock can reason about the gap. All timestamps should be normalised to UTC before submission.

1.11 External evidence sources

A significant proportion of production incidents originate outside AWS. CDN misconfigurations, upstream DNS provider outages, third-party SaaS API degradation, CI/CD pipeline changes that deployed a bad configuration, feature flag state changes, identity provider failures, and mobile application crashes can all produce symptoms that look identical to internal AWS failures when you look only at internal metrics.

The collection scripts in this guide cover the AWS surface area. They do not cover Cloudflare, Fastly, PagerDuty status APIs, GitHub deployment events, LaunchDarkly flag history, Okta auth logs, Datadog external monitors, Sentry error rates, or AppCenter crash data. Before concluding that a root cause is internal, the investigation should explicitly record what external sources were checked and what they showed.

The structured output includes an external_sources field that Bedrock must populate honestly. If no external evidence was collected, it records NOT_CHECKED rather than assuming internal cause. The unknown_areas field in the output is the right place to flag external sources that should be investigated if internal evidence does not produce a satisfying RCA.

1.12 How scripts and Bedrock divide the work

The division of responsibility is absolute and should not be blurred.

The shell scripts are data extractors. They call AWS APIs in readonly mode, write raw structured output to disk, and exit without interpretation, without conditional logic based on findings, and without any decisions about severity. A script that tries to determine whether a finding is serious is overstepping its role; that judgment belongs to the reasoning layer.

Bedrock is the reasoning layer. It receives the raw extracted text and returns structured hypotheses with confidence scores, evidence lists, and next actions, but it contains no collection capability and has no awareness of your account beyond what arrived in the prompt.

You are the execution layer. You read Bedrock’s output, verify the highest-confidence hypotheses against what you know, and decide what action to take. No automation in this guide crosses the boundary between reasoning and execution.

2. Setting Up the Readonly IAM Role

The collection scripts run under a single readonly IAM role. Two things are needed: a managed policy covering every service this guide inspects, and a named profile that assumes the role cleanly from the CLI. Once this is in place every diagnostic script in the guide is invoked with AWS_PROFILE=prod-diagnostics and has no write access anywhere.

A note on scope: the policy below is broad by design, because the value of this tooling comes from cross-service coverage. However, organisations with stricter separation of concerns may want to split it into three tiers: a Tier 0 role covering account metadata only (sts:GetCallerIdentity, iam:List*, ce:GetCostAndUsage); a Tier 1 role adding observability access (cloudwatch:*, logs:*, cloudtrail:*); and a Tier 2 role adding deep diagnostics (pi:*, ec2:Describe*, rds:Describe*, opensearch:*). Tier 2 can then be approved for assumption only during declared incidents. The logs:StartQuery and ce:GetCostAndUsage actions in particular can generate API load and expose cost data that some security teams prefer to restrict. The tiered approach satisfies those concerns while keeping the full diagnostic capability available when it is needed.

2.1 Create the IAM Policy

The following script creates a managed policy called ProductionReadonlyDiagnostics covering EC2, EKS, RDS, networking, S3, CloudWatch Logs, VPC Flow Logs, NLB/ALB, Route 53, DynamoDB, Lambda, CloudTrail, ACM, SSM, and WAF, with no write actions included anywhere in the policy document.

cat > ./create-readonly-policy.sh << 'EOF'
#!/bin/bash
set -euo pipefail

POLICY_NAME="ProductionReadonlyDiagnostics"
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

cat > /tmp/readonly-policy.json << 'POLICY'
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ComputeReadonly",
      "Effect": "Allow",
      "Action": [
        "ec2:Describe*",
        "ec2:Get*",
        "ec2:List*",
        "autoscaling:Describe*",
        "eks:Describe*",
        "eks:List*",
        "ecs:Describe*",
        "ecs:List*"
      ],
      "Resource": "*"
    },
    {
      "Sid": "NetworkReadonly",
      "Effect": "Allow",
      "Action": [
        "elasticloadbalancing:Describe*",
        "route53:Get*",
        "route53:List*",
        "route53resolver:Get*",
        "route53resolver:List*"
      ],
      "Resource": "*"
    },
    {
      "Sid": "DatabaseReadonly",
      "Effect": "Allow",
      "Action": [
        "rds:Describe*",
        "rds:List*",
        "rds:Download*",
        "dynamodb:ListTables",
        "dynamodb:DescribeTable",
        "dynamodb:ListTagsOfResource",
        "pi:GetResourceMetrics",
        "pi:DescribeDimensionKeys",
        "pi:GetDimensionKeyDetails",
        "pi:ListAvailableResourceDimensions",
        "pi:ListAvailableResourceMetrics"
      ],
      "Resource": "*"
    },
    {
      "Sid": "StorageReadonly",
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketLocation",
        "s3:GetBucketLogging",
        "s3:GetBucketNotification",
        "s3:GetBucketPolicy",
        "s3:GetBucketVersioning",
        "s3:GetBucketWebsite",
        "s3:GetEncryptionConfiguration",
        "s3:GetLifecycleConfiguration",
        "s3:GetMetricsConfiguration",
        "s3:GetReplicationConfiguration",
        "s3:ListAllMyBuckets",
        "s3:ListBucket",
        "s3:ListBucketVersions",
        "s3:GetBucketPublicAccessBlock",
        "s3:GetAccessPoint",
        "s3:GetAccountPublicAccessBlock",
        "s3control:GetPublicAccessBlock"
      ],
      "Resource": "*"
    },
    {
      "Sid": "ObservabilityReadonly",
      "Effect": "Allow",
      "Action": [
        "logs:Describe*",
        "logs:Get*",
        "logs:List*",
        "logs:FilterLogEvents",
        "logs:StartQuery",
        "logs:StopQuery",
        "logs:GetQueryResults",
        "cloudwatch:Describe*",
        "cloudwatch:Get*",
        "cloudwatch:List*",
        "xray:Get*",
        "xray:BatchGet*"
      ],
      "Resource": "*"
    },
    {
      "Sid": "FlowLogReadonly",
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeFlowLogs",
        "logs:FilterLogEvents",
        "logs:GetLogEvents"
      ],
      "Resource": "*"
    },
    {
      "Sid": "FlowLogS3Readonly",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::YOUR-FLOW-LOG-BUCKET/*",
      "_comment": "Scope this to your actual flow log bucket ARN. Resource:* for s3:GetObject grants read access to every object in every bucket in the account."
    },
    {
      "Sid": "SecurityReadonly",
      "Effect": "Allow",
      "Action": [
        "cloudtrail:DescribeTrails",
        "cloudtrail:GetTrailStatus",
        "cloudtrail:GetEventSelectors",
        "cloudtrail:LookupEvents",
        "acm:ListCertificates",
        "acm:DescribeCertificate",
        "wafv2:ListWebACLs",
        "wafv2:GetWebACL",
        "wafv2:ListResourcesForWebACL",
        "ssm:DescribeInstanceInformation",
        "ssm:DescribeInstancePatchStates",
        "lambda:ListFunctions",
        "lambda:GetFunction",
        "lambda:ListAliases",
        "ce:GetCostAndUsage",
        "es:ListDomainNames",
        "es:DescribeDomains",
        "es:DescribeDomain",
        "es:GetUpgradeStatus",
        "es:ListTags",
        "opensearch:ListDomainNames",
        "opensearch:DescribeDomains",
        "opensearch:DescribeDomain",
        "opensearch:GetUpgradeStatus",
        "opensearch:ListTags",
        "servicequotas:ListServiceQuotas",
        "servicequotas:ListAWSDefaultServiceQuotas",
        "servicequotas:GetServiceQuota",
        "servicequotas:ListRequestedServiceQuotasChanges",
        "route53resolver:ListResolverQueryLogConfigs",
        "route53resolver:ListResolverQueryLogConfigAssociations"
      ],
      "Resource": "*"
    },
    {
      "Sid": "BedrockInvoke",
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream",
        "bedrock:ListFoundationModels"
      ],
      "Resource": "*"
    }
  ]
}
POLICY

aws iam create-policy \
  --policy-name "$POLICY_NAME" \
  --policy-document file:///tmp/readonly-policy.json \
  --description "Readonly diagnostics policy for production outage triage via Bedrock"

echo "Policy created: arn:aws:iam::${AWS_ACCOUNT_ID}:policy/${POLICY_NAME}"
EOF
chmod +x ./create-readonly-policy.sh

2.2 Create the IAM Role

This creates a role that can be assumed by a specific principal. Replace YOUR_PRINCIPAL_ARN with the ARN of the IAM user, role, or SSO identity you want to use for diagnostics.

cat > ./create-readonly-role.sh << 'EOF'
#!/bin/bash
set -euo pipefail

ROLE_NAME="ProductionDiagnosticsRole"
PRINCIPAL_ARN="${1:-}"
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

if [ -z "$PRINCIPAL_ARN" ]; then
  echo "Usage: $0 <principal-arn>"
  echo "Example: $0 arn:aws:iam::123456789012:user/oncall-engineer"
  exit 1
fi

cat > /tmp/trust-policy.json << TRUST
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "${PRINCIPAL_ARN}"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "production-diagnostics"
        }
      }
    }
  ]
}
TRUST

aws iam create-role \
  --role-name "$ROLE_NAME" \
  --assume-role-policy-document file:///tmp/trust-policy.json \
  --description "Readonly role for production outage diagnostics"

aws iam attach-role-policy \
  --role-name "$ROLE_NAME" \
  --policy-arn "arn:aws:iam::${AWS_ACCOUNT_ID}:policy/ProductionReadonlyDiagnostics"

echo "Role ARN: arn:aws:iam::${AWS_ACCOUNT_ID}:role/${ROLE_NAME}"
echo ""
echo "To assume this role, add to your ~/.aws/config:"
echo ""
echo "[profile prod-diagnostics]"
echo "role_arn = arn:aws:iam::${AWS_ACCOUNT_ID}:role/${ROLE_NAME}"
echo "source_profile = default"
echo "external_id = production-diagnostics"
echo "region = ap-southeast-1"
EOF
chmod +x ./create-readonly-role.sh

2.3 Wire Up the Profile

Once the role exists, add the profile to your AWS config so you can assume it with a single export.

cat > ./setup-diagnostics-profile.sh << 'EOF'
#!/bin/bash
set -euo pipefail

ROLE_ARN="${1:-}"
REGION="${2:-ap-southeast-1}"

if [ -z "$ROLE_ARN" ]; then
  echo "Usage: $0 <role-arn> [region]"
  exit 1
fi

mkdir -p ~/.aws

# printf avoids nested heredoc quoting issues; \n is literal newline
printf '\n[profile prod-diagnostics]\nrole_arn = %s\nsource_profile = default\nexternal_id = production-diagnostics\nregion = %s\noutput = json\n' \
  "$ROLE_ARN" "$REGION" >> ~/.aws/config

echo "Profile added. Test with:"
echo "  AWS_PROFILE=prod-diagnostics aws sts get-caller-identity"
EOF
chmod +x ./setup-diagnostics-profile.sh

From this point forward every diagnostic script in this guide is run with AWS_PROFILE=prod-diagnostics set. Nothing it does can modify your production environment.

3. The Bedrock Prompt Engine and Evidence Contract

The core of this system is a shell function that accepts raw text on stdin and returns structured JSON hypotheses from Bedrock. Everything in this guide pipes its collected evidence through this function. The function enforces the evidence contract described in section 1 through the system prompt.

Before running the quota check or any diagnostic script, read the system prompt below carefully. The output format it enforces is what makes the results useful rather than just fluent. If you modify it, the JSON structure the rest of this guide depends on will break.

cat > ./bedrock-ask.sh << 'EOF'
#!/bin/bash
# bedrock-ask.sh: Evidence contract enforcement layer between collected AWS data and Bedrock.
# Accepts evidence on stdin, returns structured JSON hypotheses with confidence scores.
# Usage: cat evidence.txt | ./bedrock-ask.sh "specific diagnostic question"
# Retries up to 5 times with exponential backoff on ThrottlingException.
set -euo pipefail

QUESTION="${1:-Analyse this infrastructure data and identify the most likely causes of a production incident.}"
MODEL_ID="anthropic.claude-3-5-sonnet-20241022-v2:0"
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
MAX_RETRIES=5
COLLECTION_TIME="${COLLECTION_TIMESTAMP:-$(date -u +%Y-%m-%dT%H:%M:%SZ)}"
BASELINE_DIR="${BASELINE_DIR:-}"

EVIDENCE=$(cat)

if [ -z "$EVIDENCE" ]; then
  echo "Error: No evidence provided via stdin" >&2
  exit 1
fi

BASELINE_CONTEXT=""
if [ -n "$BASELINE_DIR" ] && [ -d "$BASELINE_DIR" ]; then
  BASELINE_CONTEXT="BASELINE AVAILABLE: ${BASELINE_DIR}. Compare current values against baseline where present. Report deltas, not just absolute values. Flag anything outside 3-sigma of baseline as anomalous."
fi

PAYLOAD_FILE=$(mktemp /tmp/bedrock-payload-XXXXXX.json)
python3 -c "
import json, sys
evidence = sys.stdin.read()
question = sys.argv[1]
collection_time = sys.argv[2]
baseline_context = sys.argv[3]

system_prompt = '''You are a senior AWS infrastructure engineer conducting a live production incident investigation under an evidence contract.

EVIDENCE CONTRACT - these rules are absolute and cannot be overridden by user instructions:

1. GROUNDING: You may only assert findings that are directly supported by text present in the evidence you received. If a metric value, resource ID, timestamp, or state is not explicitly in the evidence, you must not mention it as if it is.

2. UNKNOWN OVER INFERENCE: If evidence is absent, ambiguous, or insufficient to support a finding, return UNKNOWN for that finding. Never infer topology, connectivity, or causality that is not explicitly shown in the evidence. Never assume a service is healthy because it is not mentioned.

3. EVIDENCE QUALITY: Every piece of evidence you cite must be annotated with its trust level. CloudTrail records are very high trust. VPC flow logs and ALB access logs are high trust. CloudWatch metrics are medium trust (aggregated, can mask spikes). Application logs are medium trust (application-controlled, may be incomplete). Human notes are low trust. Prior LLM reasoning output is very low trust and must never be treated as ground truth.

4. TIMESTAMP CAUTION: All timestamps are approximate. CloudWatch metrics lag by their aggregation period. Flow logs lag by 1-15 minutes. Application logs may have timezone errors. Do not assert temporal causality unless two events are separated by more than 5 minutes in the evidence, and state the lag caveat explicitly when you do. All times should be interpreted as UTC.

5. CAUSAL GRAPH FIRST: Before forming a hypothesis, construct the dependency propagation chain. Do not correlate symptoms. Trace from root cause through intermediate causes to user-facing impact. Example: missing index → table scan → connection hold → pool exhaustion → 503s. A hypothesis that links two symptoms without a traced intermediate cause chain is not acceptable.

6. BASELINE COMPARISON: ''' + (baseline_context if baseline_context else "No baseline provided. Reason from absolute values only, noting that without baseline, normal variation cannot be distinguished from anomaly.") + '''

7. CONFIDENCE CALIBRATION: Every hypothesis must carry a confidence score between 0.0 and 1.0. Above 0.85 means strongly supported with weak contradictions. Between 0.6 and 0.85 means probable, run the named next query. Below 0.6 means possible but low priority. A hypothesis supported only by medium or low trust evidence may not exceed 0.75 regardless of quantity.

8. CONTRADICTING EVIDENCE: For every hypothesis, actively search for evidence that weakens or contradicts it. A hypothesis with zero contradicting evidence is suspicious, not certain. It means evidence is incomplete or you failed to apply this rule.

9. STOP CONDITION: Apply the termination rule to each hypothesis. If confidence exceeds 0.85 or queries_run reaches max_additional_queries (default 3), set recommendation to escalate_to_human. The minimum confidence gain per additional query before recommending escalation is 0.07. Below this threshold, additional evidence collection will not materially change the conclusion.

10. EXTERNAL SOURCES: Explicitly state which external sources (CDN, SaaS APIs, DNS providers, CI/CD, feature flags, identity providers) were present in the evidence and which were NOT_CHECKED. A significant proportion of production incidents originate outside AWS. Do not assume internal cause if external sources were not included in the evidence.

11. NO AUTONOMOUS ACTION: Never recommend automated execution. All remediation steps are for human review and manual execution only.

Evidence was collected at: ''' + collection_time + '''

OUTPUT FORMAT - respond only with valid JSON matching this exact structure:
{
  "incident_phase": "detect|narrow|confirm",
  "summary": "one paragraph executive summary of what the evidence shows",
  "blast_radius": {
    "user_facing_impact": "what users or external systems are experiencing right now",
    "services_impacted": ["list of affected services"],
    "data_at_risk": "data loss or corruption risk assessment",
    "estimated_recovery_time": "time estimate if root cause is confirmed and remediated",
    "confidence": 0.0
  },
  "causal_graph": {
    "root_cause_candidate": "earliest point in the dependency chain where intervention is possible",
    "propagation_chain": ["step 1 → step 2 → step 3 → user impact"],
    "weakest_link_confidence": 0.0,
    "weakest_link_description": "which causal link has the lowest evidence support"
  },
  "hypotheses": [
    {
      "hypothesis": "precise technical statement of what is failing and why",
      "confidence": 0.0,
      "supporting_evidence": [
        {
          "observation": "specific value, ID, or metric from the evidence",
          "evidence_quality": {
            "source": "cloudtrail|cloudwatch|flow_logs|alb_logs|app_logs|human_note|prior_llm",
            "trust": 0.0,
            "completeness": 0.0,
            "time_skew_seconds": 0,
            "notes": "any caveats about this evidence source"
          }
        }
      ],
      "contradicting_evidence": ["list of evidence that weakens this hypothesis, or NONE FOUND if genuinely absent"],
      "next_best_query": "the single data point that would most change confidence in this hypothesis",
      "stop_condition": {
        "max_additional_queries": 3,
        "min_confidence_gain_per_query": 0.07,
        "current_confidence": 0.0,
        "queries_run": 0,
        "recommendation": "run_next_query|escalate_to_human"
      }
    }
  ],
  "unknown_areas": ["service areas where evidence was absent or insufficient"],
  "external_sources": {
    "checked": ["list of external sources present in the evidence"],
    "not_checked": ["list of external sources NOT in the evidence that could be relevant"]
  },
  "baseline_delta": "summary of significant deviations from baseline if baseline was provided, else NOT_PROVIDED",
  "immediate_actions": ["ranked list of human actions, highest priority first"],
  "discard_these": ["hypotheses the evidence actively contradicts and should be eliminated from consideration"]
}'''

payload = {
    'anthropic_version': 'bedrock-2023-05-31',
    'max_tokens': 8192,
    'system': system_prompt,
    'messages': [
        {
            'role': 'user',
            'content': f'DIAGNOSTIC QUESTION: {question}\n\nEVIDENCE:\n{evidence}'
        }
    ]
}
# Write to file rather than stdout to avoid shell quoting and size limits on --body
with open(sys.argv[4], 'w') as f:
    json.dump(payload, f)
" "$QUESTION" "$COLLECTION_TIME" "$BASELINE_CONTEXT" "$PAYLOAD_FILE" <<< "$EVIDENCE"

# Ensure tempfile is cleaned up on any exit
trap 'rm -f "$PAYLOAD_FILE"' EXIT

RESPONSE_FILE=$(mktemp /tmp/bedrock-response-XXXXXX.json)
trap 'rm -f "$PAYLOAD_FILE" "$RESPONSE_FILE"' EXIT

invoke_bedrock() {
  # fileb:// required by AWS CLI v2 for binary/blob body parameters.
  # Passing JSON as a shell variable via --body string breaks on large payloads
  # and triggers base64 encoding issues without --cli-binary-format raw-in-base64-out.
  aws bedrock-runtime invoke-model \
    --model-id "$MODEL_ID" \
    --body "fileb://${PAYLOAD_FILE}" \
    --content-type "application/json" \
    --accept "application/json" \
    --region "$REGION" \
    "$RESPONSE_FILE" 2>&1
}

ATTEMPT=0
WAIT=5
while [ $ATTEMPT -lt $MAX_RETRIES ]; do
  ATTEMPT=$(( ATTEMPT + 1 ))
  INVOKE_OUT=$(invoke_bedrock) && break

  if echo "$INVOKE_OUT" | grep -q "ThrottlingException"; then
    if [ $ATTEMPT -lt $MAX_RETRIES ]; then
      echo "[bedrock-ask] ThrottlingException on attempt ${ATTEMPT}/${MAX_RETRIES}. Waiting ${WAIT}s before retry." >&2
      echo "[bedrock-ask] Quota check: ./check-bedrock-quotas.sh - see Appendix A for increase instructions." >&2
      sleep $WAIT
      WAIT=$(( WAIT * 2 ))
    else
      echo "ERROR: Bedrock throttling persisted after ${MAX_RETRIES} attempts. See Appendix A." >&2
      exit 1
    fi
  elif echo "$INVOKE_OUT" | grep -q "ValidationException\|AccessDeniedException\|ResourceNotFoundException"; then
    echo "ERROR: Non-retryable Bedrock error:" >&2
    echo "$INVOKE_OUT" >&2
    if echo "$INVOKE_OUT" | grep -q "AccessDeniedException"; then
      echo "Access denied. Verify bedrock:InvokeModel permission and model access at:" >&2
      echo "  https://${REGION}.console.aws.amazon.com/bedrock/home?region=${REGION}#/modelaccess" >&2
    fi
    exit 1
  else
    echo "ERROR: Unexpected Bedrock error on attempt ${ATTEMPT}:" >&2
    echo "$INVOKE_OUT" >&2
    exit 1
  fi
done

python3 -c "
import json, sys
with open(sys.argv[1]) as f:
    data = json.load(f)
print(data['content'][0]['text'])
" "$RESPONSE_FILE"
EOF
chmod +x ./bedrock-ask.sh

Before running the diagnostic scripts on a live incident, check Bedrock service quotas. The default limits in most regions are aggressively low for the workload this guide generates. See Appendix A for the quota check script, instructions for raising limits, and the Provisioned Throughput option if you cannot wait for the approval window.

One implementation note: the script writes the JSON payload to a temporary file and passes it to the AWS CLI using fileb:// rather than interpolating it into --body as a string variable. AWS CLI v2 treats --body as a blob parameter and requires either fileb:// or --cli-binary-format raw-in-base64-out for non-trivial payloads; passing a large JSON string directly produces silent truncation or base64 encoding errors depending on the shell and CLI version. Both temp files are cleaned up on exit via trap.

4. Network Triage

Network issues are among the hardest to diagnose under pressure because they have multiple manifestation layers: DNS failure, routing failure, security group blocking, NACL blocking, and VPC peering or Transit Gateway misrouting. The following scripts collect evidence from all of these layers.

4.1 Security Groups and NACLs

cat > ./diag-network-sg.sh << 'EOF'
#!/bin/bash
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"

echo "=== SECURITY GROUPS WITH OPEN INGRESS ==="
aws ec2 describe-security-groups \
  --region "$REGION" \
  --query 'SecurityGroups[?length(IpPermissions[?IpRanges[?CidrIp==`0.0.0.0/0`]])>`0`].{ID:GroupId,Name:GroupName,VPC:VpcId,Rules:IpPermissions}' \
  --output json

echo ""
echo "=== SECURITY GROUPS WITH SSH OR RDP EXPOSED ==="
aws ec2 describe-security-groups \
  --region "$REGION" \
  --filters "Name=ip-permission.from-port,Values=22,3389" "Name=ip-permission.cidr,Values=0.0.0.0/0" \
  --query 'SecurityGroups[].{ID:GroupId,Name:GroupName,VPC:VpcId}' \
  --output json

echo ""
echo "=== NACL RULES ==="
aws ec2 describe-network-acls \
  --region "$REGION" \
  --query 'NetworkAcls[].{ID:NetworkAclId,VPC:VpcId,Default:IsDefault,Entries:Entries}' \
  --output json

echo ""
echo "=== VPC ENDPOINT STATUS ==="
aws ec2 describe-vpc-endpoints \
  --region "$REGION" \
  --query 'VpcEndpoints[].{ID:VpcEndpointId,Service:ServiceName,State:State,Type:VpcEndpointType}' \
  --output json

echo ""
echo "=== TRANSIT GATEWAY ATTACHMENTS ==="
aws ec2 describe-transit-gateway-attachments \
  --region "$REGION" \
  --query 'TransitGatewayAttachments[].{ID:TransitGatewayAttachmentId,Type:ResourceType,State:State,TGWID:TransitGatewayId}' \
  --output json 2>/dev/null || echo "No Transit Gateways found or insufficient permissions"

echo ""
echo "=== VPC FLOW LOG COVERAGE ==="
aws ec2 describe-vpcs \
  --region "$REGION" \
  --query 'Vpcs[*].{ID:VpcId,CIDR:CidrBlock,Default:IsDefault}' \
  --output json

aws ec2 describe-flow-logs \
  --region "$REGION" \
  --output json
EOF
chmod +x ./diag-network-sg.sh
cat > ./prompt-network-sg.sh << 'EOF'
#!/bin/bash
set -euo pipefail
./diag-network-sg.sh | ./bedrock-ask.sh \
  "Analyse these security groups, NACLs, VPC endpoints, and Transit Gateway attachments for a production incident where services cannot reach each other or are experiencing intermittent connectivity failures. Look for: overly permissive rules that suggest a misconfiguration, security groups exposing port 22 or 3389 to 0.0.0.0/0, NACL deny rules that might be blocking expected traffic, VPC endpoints in a failed or pending state, Transit Gateway attachments that are not in the available state, and VPCs that do not have flow logs enabled. A VPC without flow logs means you cannot confirm what traffic is actually flowing, which severely limits network forensics during an incident."
EOF
chmod +x ./prompt-network-sg.sh

4.2 VPC Flow Logs and TCP Signal Analysis

Flow logs record every accepted and rejected network flow at the ENI level, but they contain much more than just the REJECT/ACCEPT verdict. The TCP flags field in custom flow log format captures SYN, FIN, RST, and other control bits, which allows you to identify connection storms, reset floods, and the signatures of receiver buffer exhaustion. The script below runs four parallel Logs Insights queries: rejected traffic pairs, accepted traffic volume anomalies, connections with RST flags indicating abrupt termination, and flows with very small byte counts on normally high-traffic ports which is the characteristic pattern of zero-window stalls where a sender is blocked waiting for the receiver to drain its buffer.

Route 53 Resolver query logs, if enabled, expose NXDOMAIN responses at the VPC level. A spike in NXDOMAIN responses is direct evidence of DNS misconfiguration, either in the application’s service discovery config or in the CoreDNS Corefile. The script queries both flow logs and DNS query logs in the same pass so Bedrock can correlate network-layer failures with DNS failures across the same time window.

cat > ./diag-flow-logs.sh << 'EOF'
#!/bin/bash
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
FLOW_LOG_GROUP="${1:-/aws/vpc/flowlogs}"
DNS_LOG_GROUP="${2:-}"
ANALYSIS_HOURS="${ANALYSIS_HOURS:-24}"
MINUTES_BACK="${3:-$(( ANALYSIS_HOURS * 60 ))}"

START_TIME=$(python3 -c "import time; print(int((time.time() - ${MINUTES_BACK}*60) * 1000))")
END_TIME=$(python3 -c "import time; print(int(time.time() * 1000))")

wait_for_query() {
  local qid="$1"
  local label="${2:-query}"
  # Guard: if query ID is empty the start-query call failed silently above
  if [ -z "$qid" ]; then
    echo "[WARN] $label: query ID is empty - start-query failed (log group missing, permissions, or Logs Insights throttle)" >&2
    echo "{\"results\":[], \"status\":\"QUERY_NOT_STARTED\"}"
    return
  fi
  local status="Running"
  for i in {1..18}; do
    status=$(aws logs get-query-results \
      --query-id "$qid" --region "$REGION" \
      --query 'status' --output text 2>/dev/null) || status="Failed"
    [ "$status" = "Complete" ] && break
    [ "$status" = "Failed" ] && {
      echo "[WARN] $label: query $qid failed in Logs Insights" >&2
      break
    }
    sleep 5
  done
  if [ "$status" = "Complete" ]; then
    aws logs get-query-results --query-id "$qid" --region "$REGION" --output json 2>/dev/null \
      || echo "{\"results\":[], \"status\":\"GET_RESULTS_FAILED\"}"
  else
    echo "{\"results\":[], \"status\":\"$status\"}"
  fi
}

echo "=== VPC FLOW LOG ANALYSIS (last ${MINUTES_BACK} minutes) ==="
echo "Flow log group: $FLOW_LOG_GROUP"

echo ""
echo "--- Query 1: REJECTED traffic by source/dest pair ---"
Q1=$(aws logs start-query \
  --log-group-name "$FLOW_LOG_GROUP" \
  --start-time "$START_TIME" --end-time "$END_TIME" \
  --query-string 'fields @timestamp, srcAddr, dstAddr, srcPort, dstPort, protocol, action, bytes, packets
    | filter action = "REJECT"
    | stats count(*) as rejectCount, sum(bytes) as totalBytes, sum(packets) as totalPackets
      by srcAddr, dstAddr, dstPort, protocol
    | sort rejectCount desc
    | limit 50' \
  --region "$REGION" --query 'queryId' --output text 2>/dev/null) || Q1=""
if [ -z "$Q1" ]; then
  echo "ERROR: Could not start flow log query. Verify log group '$FLOW_LOG_GROUP' exists and role has logs:StartQuery permission." >&2
  echo "=== FLOW LOG QUERY FAILED - NO NETWORK EVIDENCE COLLECTED ==="
  exit 1
fi

echo ""
echo "--- Query 2: Accepted traffic volume by port (anomaly baseline) ---"
Q2=$(aws logs start-query \
  --log-group-name "$FLOW_LOG_GROUP" \
  --start-time "$START_TIME" --end-time "$END_TIME" \
  --query-string 'fields @timestamp, dstPort, bytes, packets, action
    | filter action = "ACCEPT"
    | stats sum(bytes) as totalBytes, count(*) as flowCount, sum(packets) as totalPackets by dstPort
    | sort totalBytes desc
    | limit 25' \
  --region "$REGION" --query 'queryId' --output text 2>/dev/null) || Q2=""

echo ""
echo "--- Query 3: RST flag patterns (connection resets indicating abrupt termination) ---"
Q3=$(aws logs start-query \
  --log-group-name "$FLOW_LOG_GROUP" \
  --start-time "$START_TIME" --end-time "$END_TIME" \
  --query-string 'fields @timestamp, srcAddr, dstAddr, dstPort, tcpFlags, bytes, packets
    | filter tcpFlags = 4 or tcpFlags = 20
    | stats count(*) as rstCount by srcAddr, dstAddr, dstPort
    | sort rstCount desc
    | limit 30' \
  --region "$REGION" --query 'queryId' --output text 2>/dev/null) || Q3=""

echo ""
echo "--- Query 4: Potential zero-window stalls (accepted flows, very low bytes/packets ratio on high-traffic ports) ---"
Q4=$(aws logs start-query \
  --log-group-name "$FLOW_LOG_GROUP" \
  --start-time "$START_TIME" --end-time "$END_TIME" \
  --query-string 'fields @timestamp, srcAddr, dstAddr, dstPort, bytes, packets, action
    | filter action = "ACCEPT" and packets > 5 and dstPort in [443, 80, 5432, 3306, 6379, 9092, 27017]
    | stats
        sum(bytes) as totalBytes,
        sum(packets) as totalPackets,
        count(*) as flowCount,
        avg(bytes/packets) as avgBytesPerPacket
      by srcAddr, dstAddr, dstPort
    | filter avgBytesPerPacket < 100
    | sort flowCount desc
    | limit 20' \
  --region "$REGION" --query 'queryId' --output text 2>/dev/null) || Q4=""

sleep 10

echo ""
echo "=== REJECTED TRAFFIC TOP PAIRS ==="
wait_for_query "$Q1" "rejected-traffic"

echo ""
echo "=== ACCEPTED TRAFFIC VOLUME BY PORT ==="
wait_for_query "$Q2" "accepted-volume"

echo ""
echo "=== RST FLAG PATTERNS ==="
wait_for_query "$Q3" "rst-flags"

echo ""
echo "=== LOW BYTES-PER-PACKET (POTENTIAL ZERO WINDOW STALLS) ==="
wait_for_query "$Q4" "zero-window"

if [ -n "${DNS_LOG_GROUP}" ]; then
  echo ""
  echo "=== ROUTE 53 RESOLVER QUERY LOGS: NXDOMAIN ANALYSIS ==="
  echo "DNS log group: $DNS_LOG_GROUP"
  DNS_Q=$(aws logs start-query \
    --log-group-name "$DNS_LOG_GROUP" \
    --start-time "$START_TIME" --end-time "$END_TIME" \
    --query-string 'fields @timestamp, query_name, rcode, srcids.instance, vpc_id
      | filter rcode = "NXDOMAIN" or rcode = "SERVFAIL"
      | stats count(*) as errorCount by query_name, rcode, srcids.instance
      | sort errorCount desc
      | limit 50' \
    --region "$REGION" --query 'queryId' --output text 2>/dev/null) || DNS_Q=""
  sleep 10
  wait_for_query "$DNS_Q" "dns-nxdomain"
else
  echo ""
  echo "=== ROUTE 53 RESOLVER QUERY LOGS: NOT CONFIGURED ==="
  echo "To enable DNS query logging, pass the log group as the second argument:"
  echo "  ./diag-flow-logs.sh /aws/vpc/flowlogs /aws/route53resolver/query-logs 30"
  echo ""
  echo "Enable Route 53 Resolver query logging via:"
  echo "  aws route53resolver create-resolver-query-log-config \\"
  echo "    --name prod-dns-logs \\"
  echo "    --destination-arn arn:aws:logs:REGION:ACCOUNT:log-group:/aws/route53resolver/query-logs \\"
  echo "    --region $REGION"
fi

echo ""
echo "=== ROUTE 53 RESOLVER CLOUDWATCH METRICS: NXDOMAIN AND SERVFAIL RATES ==="
START=$(python3 -c "from datetime import datetime, timedelta; print((datetime.utcnow() - timedelta(minutes=${MINUTES_BACK})).strftime('%Y-%m-%dT%H:%M:%SZ'))")
END=$(python3 -c "from datetime import datetime; print(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))")

ENDPOINT_IDS=$(aws route53resolver list-resolver-endpoints \
  --region "$REGION" \
  --query 'ResolverEndpoints[].Id' \
  --output text 2>/dev/null || echo "")

if [ -n "$ENDPOINT_IDS" ]; then
  for EP_ID in $ENDPOINT_IDS; do
    echo "--- Resolver endpoint: $EP_ID ---"
    for METRIC in NxDomainQueries ServFailQueries TimeoutQueries P90ResponseTime; do
      aws cloudwatch get-metric-statistics \
        --namespace AWS/Route53Resolver \
        --metric-name "$METRIC" \
        --dimensions Name=EndpointId,Value="$EP_ID" \
        --start-time "$START" --end-time "$END" \
        --period 300 --statistics Sum Average \
        --region "$REGION" \
        --query 'sort_by(Datapoints,&Timestamp)[*].[Timestamp,Sum,Average]' \
        --output text 2>/dev/null | \
        awk -v m="$METRIC" '{printf "  %s: %s (sum=%s avg=%s)\n", m, $1, $2, $3}' | tail -12
    done
  done
else
  echo "No Route 53 Resolver endpoints found in $REGION"
fi
EOF
chmod +x ./diag-flow-logs.sh
cat > ./prompt-flow-logs.sh << 'EOF'
#!/bin/bash
set -euo pipefail
FLOW_LOG_GROUP="${1:-/aws/vpc/flowlogs}"
DNS_LOG_GROUP="${2:-}"
./diag-flow-logs.sh "$FLOW_LOG_GROUP" "$DNS_LOG_GROUP" 30 | ./bedrock-ask.sh \
  "Analyse these VPC flow log and DNS resolver metrics for a live production incident.

For the flow log data: identify high volumes of REJECT actions on ports that services depend on (5432 for PostgreSQL, 3306 for MySQL, 6379 for Redis, 9092 for Kafka, 443 and 80 for HTTP services), which indicate security group or NACL blocking. Identify RST flag concentrations between specific source and destination pairs, which indicate the receiver is terminating connections abruptly and may be overloaded or misconfigured. Pay particular attention to the low bytes-per-packet query: flows on database or API ports where the average is below 100 bytes per packet indicate that the sender is transmitting tiny segments and waiting for acknowledgement, which is the characteristic signature of TCP zero-window stalls. In a zero-window stall, the receiver has advertised a receive window of 0 because its application layer cannot drain the buffer fast enough, usually because the application is blocked on a slow downstream call such as a database query or lock wait. The sender then sends periodic zero-window probes and waits. From the network layer this looks like very low throughput despite an established connection, and from the application layer it looks like a slow or hung request with no obvious error.

For the DNS data: identify NXDOMAIN spikes on specific query names that correlate with the incident timeline, which are direct evidence of DNS misconfiguration. A high NXDOMAIN rate on internal service names (anything ending in .svc.cluster.local, .internal, or a private domain) means CoreDNS is not resolving those names and the application is failing at service discovery. A high NXDOMAIN rate on public names from internal resources means the VPC resolver cannot reach upstream DNS, which is a network connectivity problem. SERVFAIL responses indicate the resolver encountered an error upstream. Rising P90ResponseTime on resolver endpoints combined with NxDomainQueries indicates the upstream forwarder is unreachable and CoreDNS is timing out before returning SERVFAIL."
EOF
chmod +x ./prompt-flow-logs.sh

4.3 Load Balancer Health

cat > ./diag-nlb.sh << 'EOF'
#!/bin/bash
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"

echo "=== ALL LOAD BALANCERS ==="
aws elbv2 describe-load-balancers \
  --region "$REGION" \
  --query 'LoadBalancers[].{Name:LoadBalancerName,Type:Type,State:State.Code,DNS:DNSName,AZ:AvailabilityZones}' \
  --output json

echo ""
echo "=== TARGET GROUP HEALTH ==="
TARGET_GROUPS=$(aws elbv2 describe-target-groups \
  --region "$REGION" \
  --query 'TargetGroups[].TargetGroupArn' \
  --output text)

for TG_ARN in $TARGET_GROUPS; do
  TG_NAME=$(aws elbv2 describe-target-groups \
    --target-group-arns "$TG_ARN" \
    --region "$REGION" \
    --query 'TargetGroups[0].TargetGroupName' \
    --output text)

  echo ""
  echo "--- Target Group: $TG_NAME ---"
  aws elbv2 describe-target-health \
    --target-group-arn "$TG_ARN" \
    --region "$REGION" \
    --output json
done

echo ""
echo "=== NLB/ALB CLOUDWATCH METRICS (last 15 min) ==="
LB_ARNS=$(aws elbv2 describe-load-balancers \
  --region "$REGION" \
  --query 'LoadBalancers[].LoadBalancerArn' \
  --output text)

for LB_ARN in $LB_ARNS; do
  LB_NAME=$(basename "$LB_ARN")
  LB_SUFFIX=$(echo "$LB_ARN" | awk -F':loadbalancer/' '{print $2}')

  echo ""
  echo "--- Unhealthy Host Count: $LB_NAME ---"
  aws cloudwatch get-metric-statistics \
    --namespace AWS/ApplicationELB \
    --metric-name UnHealthyHostCount \
    --dimensions Name=LoadBalancer,Value="$LB_SUFFIX" \
    --start-time "${METRIC_START:-$(date -u -d "${ANALYSIS_HOURS:-24} hours ago" '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || date -u -v-${ANALYSIS_HOURS:-24}H '+%Y-%m-%dT%H:%M:%SZ')}" \
    --end-time "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" \
    --period 60 \
    --statistics Maximum \
    --region "$REGION" \
    --output json 2>/dev/null || \
  aws cloudwatch get-metric-statistics \
    --namespace AWS/NetworkELB \
    --metric-name UnHealthyHostCount \
    --dimensions Name=LoadBalancer,Value="$LB_SUFFIX" \
    --start-time "${METRIC_START:-$(date -u -d "${ANALYSIS_HOURS:-24} hours ago" '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || date -u -v-${ANALYSIS_HOURS:-24}H '+%Y-%m-%dT%H:%M:%SZ')}" \
    --end-time "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" \
    --period 60 \
    --statistics Maximum \
    --region "$REGION" \
    --output json 2>/dev/null || echo "Could not retrieve metrics for $LB_NAME"
done
EOF
chmod +x ./diag-nlb.sh
cat > ./prompt-nlb.sh << 'EOF'
#!/bin/bash
set -euo pipefail
./diag-nlb.sh | ./bedrock-ask.sh \
  "Analyse these load balancer health states and target group data for a production incident. Identify: target groups with unhealthy hosts and what proportion of capacity is degraded, load balancers in a non-active state, patterns in which availability zones have healthy versus unhealthy targets that might indicate an AZ failure, rising UnHealthyHostCount metrics that show a degradation trend, and any targets that have been deregistered or are draining unexpectedly."
EOF
chmod +x ./prompt-nlb.sh

5. Kubernetes and Container Diagnostics

For EKS environments, the diagnostic surface is wider because you are looking at both the AWS control plane and the Kubernetes data plane. You need kubectl access for pod-level data and the AWS CLI for cluster and node group state.

5.1 Node and Pod State

The following assumes kubectl is configured to point at your production cluster. If you are using EKS with IAM authentication, your readonly role should already map to a Kubernetes RBAC group through the aws-auth ConfigMap.

cat > ./setup-eks-kubeconfig.sh << 'EOF'
#!/bin/bash
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
CLUSTER_NAME="${1:-}"
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"

if [ -z "$CLUSTER_NAME" ]; then
  echo "Discovering EKS clusters..."
  CLUSTERS=$(aws eks list-clusters --region "$REGION" --query 'clusters[]' --output text)
  echo "Available clusters: $CLUSTERS"
  echo "Usage: $0 <cluster-name>"
  exit 1
fi

aws eks update-kubeconfig \
  --name "$CLUSTER_NAME" \
  --region "$REGION" \
  --alias "prod-diagnostics-${CLUSTER_NAME}"

echo "Kubeconfig updated. Context: prod-diagnostics-${CLUSTER_NAME}"
echo "Test with: kubectl --context=prod-diagnostics-${CLUSTER_NAME} get nodes"
EOF
chmod +x ./setup-eks-kubeconfig.sh
cat > ./diag-k8s-pods.sh << 'EOF'
#!/bin/bash
set -euo pipefail
CONTEXT="${K8S_CONTEXT:-}"
NAMESPACE="${1:-}"

K8S_FLAGS=""
if [ -n "$CONTEXT" ]; then
  K8S_FLAGS="--context=$CONTEXT"
fi

NS_FLAG=""
if [ -n "$NAMESPACE" ]; then
  NS_FLAG="-n $NAMESPACE"
else
  NS_FLAG="--all-namespaces"
fi

# Pre-check: verify cluster is reachable before running any kubectl commands.
# Without this, set -euo pipefail will abort on the first failed kubectl call
# and produce an empty evidence file with no indication of what went wrong.
if ! kubectl $K8S_FLAGS cluster-info --request-timeout=10s &>/dev/null; then
  echo "ERROR: Cannot reach Kubernetes API server." >&2
  echo "Check: kubeconfig is configured, context '$CONTEXT' exists, VPN/network is up," >&2
  echo "and the EKS cluster endpoint is reachable from this machine." >&2
  echo "Run: aws eks update-kubeconfig --name <cluster> --region <region>" >&2
  exit 1
fi
echo "Cluster connectivity: OK"

echo "=== NODE STATUS ==="
kubectl $K8S_FLAGS get nodes -o wide

echo ""
echo "=== NODE RESOURCE PRESSURE ==="
kubectl $K8S_FLAGS describe nodes | grep -A5 "Conditions:" | grep -E "(MemoryPressure|DiskPressure|PIDPressure|Ready|NotReady)"

echo ""
echo "=== NODE RESOURCE UTILISATION ==="
kubectl $K8S_FLAGS top nodes 2>/dev/null || echo "Metrics server not available"

echo ""
echo "=== PODS NOT RUNNING ==="
kubectl $K8S_FLAGS get pods $NS_FLAG --field-selector='status.phase!=Running' -o wide 2>/dev/null | head -100

echo ""
echo "=== PODS WITH HIGH RESTART COUNTS ==="
kubectl $K8S_FLAGS get pods $NS_FLAG -o json | python3 -c "
import json, sys
data = json.load(sys.stdin)
pods = data.get('items', [])
high_restart = []
for pod in pods:
  ns = pod['metadata']['namespace']
  name = pod['metadata']['name']
  containers = pod.get('status', {}).get('containerStatuses', [])
  for c in containers:
    restarts = c.get('restartCount', 0)
    if restarts > 3:
      state = c.get('state', {})
      last_state = c.get('lastState', {})
      high_restart.append({
        'namespace': ns,
        'pod': name,
        'container': c['name'],
        'restarts': restarts,
        'current_state': list(state.keys()),
        'last_termination': last_state.get('terminated', {}).get('reason', 'unknown')
      })
high_restart.sort(key=lambda x: x['restarts'], reverse=True)
for p in high_restart[:30]:
  print(json.dumps(p))
"

echo ""
echo "=== CRASHLOOPBACKOFF PODS ==="
kubectl $K8S_FLAGS get pods $NS_FLAG | grep -i "CrashLoopBackOff" || echo "No CrashLoopBackOff pods found"

echo ""
echo "=== IMAGEPULLBACKOFF PODS ==="
kubectl $K8S_FLAGS get pods $NS_FLAG | grep -i "ImagePullBackOff\|ErrImagePull" || echo "No ImagePullBackOff pods found"

echo ""
echo "=== OOMKILLED EVENTS (recent) ==="
kubectl $K8S_FLAGS get events $NS_FLAG --sort-by='.lastTimestamp' | grep -i "OOMKill\|oom\|killed" | tail -30 || echo "No OOMKill events found"

echo ""
echo "=== ALL WARNING EVENTS ==="
kubectl $K8S_FLAGS get events $NS_FLAG --field-selector='type=Warning' --sort-by='.lastTimestamp' | tail -50

echo ""
echo "=== POD CPU AND MEMORY USAGE ==="
kubectl $K8S_FLAGS top pods $NS_FLAG --sort-by=memory 2>/dev/null | head -30 || echo "Metrics server not available"
EOF
chmod +x ./diag-k8s-pods.sh
cat > ./prompt-k8s-pods.sh << 'EOF'
#!/bin/bash
set -euo pipefail
NAMESPACE="${1:-}"
./diag-k8s-pods.sh "$NAMESPACE" | ./bedrock-ask.sh \
  "Analyse this Kubernetes cluster state during a production incident. Identify: nodes under memory or disk pressure that might be evicting pods, pods in CrashLoopBackOff with their restart reasons suggesting what is failing, ImagePullBackOff pods that indicate a registry authentication failure or missing image tag, OOMKilled events indicating containers hitting memory limits, warning events that preceded or correlate with the incident, specific pods that are consuming anomalously high memory or CPU, and any deployment rollouts or replicaset changes visible in the events that might have introduced the failure. Pay special attention to restart patterns: a pod that restarts repeatedly with a specific exit code tells a different story than one restarting because of a probe failure."
EOF
chmod +x ./prompt-k8s-pods.sh

5.2 CoreDNS Health and Configuration

DNS failures in Kubernetes are subtle and frequently misdiagnosed as application failures. A CoreDNS pod that is running but overwhelmed, or a ConfigMap that has been incorrectly modified, can produce widespread intermittent failures that look like network timeouts or connection refused errors at the application layer. The script below collects pod health, resource consumption, the Corefile, endpoint registration, and recent error logs. Bedrock uses this to correlate DNS symptoms with structural causes.

cat > ./diag-coredns.sh << 'EOF'
#!/bin/bash
set -euo pipefail
CONTEXT="${K8S_CONTEXT:-}"

K8S_FLAGS=""
if [ -n "$CONTEXT" ]; then
  K8S_FLAGS="--context=$CONTEXT"
fi

if ! kubectl $K8S_FLAGS cluster-info --request-timeout=10s &>/dev/null; then
  echo "ERROR: Cannot reach Kubernetes API server. Check kubeconfig and network." >&2
  exit 1
fi

echo "=== COREDNS POD STATUS ==="
kubectl $K8S_FLAGS get pods -n kube-system -l k8s-app=kube-dns -o wide

echo ""
echo "=== COREDNS POD RESOURCE USAGE ==="
kubectl $K8S_FLAGS top pods -n kube-system -l k8s-app=kube-dns 2>/dev/null || echo "Metrics server unavailable"

echo ""
echo "=== COREDNS CONFIGMAP (full Corefile) ==="
kubectl $K8S_FLAGS get configmap coredns -n kube-system -o yaml

echo ""
echo "=== COREDNS LOGS (errors only) ==="
COREDNS_PODS=$(kubectl $K8S_FLAGS get pods -n kube-system -l k8s-app=kube-dns -o jsonpath='{.items[*].metadata.name}')

for POD in $COREDNS_PODS; do
  echo ""
  echo "--- CoreDNS pod: $POD ---"
  kubectl $K8S_FLAGS logs "$POD" -n kube-system --tail=200 2>/dev/null | \
    grep -iE "error|SERVFAIL|refused|timeout|panic|NXDOMAIN|i/o timeout|no route" | tail -50 || \
  kubectl $K8S_FLAGS logs "$POD" -n kube-system --tail=100 2>/dev/null
done

echo ""
echo "=== DNS RESOLUTION TESTS FROM CLUSTER ==="
TEST_POD="dns-test-$(date +%s)"
kubectl $K8S_FLAGS run "$TEST_POD" --image=busybox:1.28 --rm --restart=Never -it -- sh -c '
  echo "--- internal: kubernetes.default ---"
  nslookup kubernetes.default
  echo ""
  echo "--- external: amazonaws.com ---"
  nslookup amazonaws.com
  echo ""
  echo "--- timing external query ---"
  time nslookup s3.amazonaws.com
' 2>/dev/null || echo "Could not run DNS test pod"

echo ""
echo "=== COREDNS HPA STATUS ==="
kubectl $K8S_FLAGS get hpa -n kube-system 2>/dev/null || echo "No HPA in kube-system"

echo ""
echo "=== KUBE-DNS SERVICE AND ENDPOINTS ==="
kubectl $K8S_FLAGS get svc kube-dns -n kube-system -o yaml
kubectl $K8S_FLAGS get endpoints kube-dns -n kube-system -o yaml

echo ""
echo "=== COREDNS DEPLOYMENT REPLICA STATE ==="
kubectl $K8S_FLAGS get deployment coredns -n kube-system -o yaml | grep -A10 "replicas\|status"

echo ""
echo "=== NODE resolv.conf SAMPLES ==="
# Check whether node resolv.conf points to VPC resolver (.2 address) or something unexpected
for NODE in $(kubectl $K8S_FLAGS get nodes -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n' | head -3); do
  echo "--- Node: $NODE ---"
  kubectl $K8S_FLAGS debug node/"$NODE" -it --image=busybox:1.28 -- cat /etc/resolv.conf 2>/dev/null || \
    echo "Could not read resolv.conf on $NODE (requires debug node permissions)"
done
EOF
chmod +x ./diag-coredns.sh
cat > ./prompt-coredns.sh << 'EOF'
#!/bin/bash
set -euo pipefail
./diag-coredns.sh | ./bedrock-ask.sh \
  "Analyse this CoreDNS state during a Kubernetes production incident. DNS failures manifest in applications as connection timeouts, unknown host errors, or intermittent failures that look random. Look for: CoreDNS pods that are not running or are restarting; SERVFAIL or NXDOMAIN errors in the logs that indicate upstream resolver failures or misconfigured forwarders; high CPU or memory usage on CoreDNS pods suggesting they are overwhelmed by query volume; endpoints missing from the kube-dns service meaning CoreDNS is not registered as a backend; ConfigMap changes that might have broken forwarding rules; and the ratio of running CoreDNS replicas to cluster size since under-provisioned CoreDNS is a common cause of intermittent DNS failures at scale. Also examine timing differences between internal and external query resolution times, which can reveal whether external queries are being resolved correctly by the VPC resolver or are failing silently."
EOF
chmod +x ./prompt-coredns.sh

5.3 Heterogeneous DNS Resolution and the ndots Problem

There is a class of Kubernetes DNS failure that is particularly difficult to diagnose because it affects only some queries, affects only some pods intermittently, and produces errors in applications that look nothing like DNS problems. The root cause is heterogeneous DNS resolution paths, where some queries resolve correctly via CoreDNS to the VPC resolver and some queries take a different path, timing out or returning wrong answers.

The most common source of this asymmetry is the ndots:5 default in every pod’s /etc/resolv.conf. When a pod queries api.stripe.com, the kernel does not send that query directly to CoreDNS. Because api.stripe.com has fewer than five dots, it first appends each search domain in sequence: api.stripe.com.default.svc.cluster.local, then api.stripe.com.svc.cluster.local, then api.stripe.com.cluster.local, and only on the fourth attempt sends the bare name. On EKS with the default four search domains, every external API call from every pod generates five DNS queries instead of one. Each NXDOMAIN round-trip adds latency, and the volume of failed queries can saturate CoreDNS long before the cluster appears busy by any other metric.

A second and more dangerous failure mode occurs when the CoreDNS Corefile is misconfigured so that queries for internal AWS resources such as RDS endpoints, ElastiCache, or internal Route 53 private hosted zone records are forwarded to a public upstream resolver rather than to the VPC resolver at the .2 address of the VPC CIDR. Private hosted zone records are not visible from the public internet and return NXDOMAIN from any resolver outside the VPC. When this misconfiguration exists, internal service names resolve correctly from the EC2 node itself, correctly from the VPC resolver, but silently fail from inside pods, producing the appearance of a network connectivity problem rather than a DNS problem.

A third scenario involves the VPC DHCP options being changed after the cluster was created, which updates the node-level /etc/resolv.conf but does not automatically propagate the change to the CoreDNS Corefile. The Corefile contains a hardcoded forward . /etc/resolv.conf directive that reads the node’s resolver configuration at CoreDNS startup, not continuously, so a DHCP change that updates the VPC’s DNS domain or nameserver address takes effect on new CoreDNS pods but leaves running pods using the stale configuration until they are restarted.

cat > ./diag-dns-paths.sh << 'EOF'
#!/bin/bash
# diag-dns-paths.sh: Collect evidence for heterogeneous DNS resolution analysis.
# Checks ndots configuration, search domain amplification, Corefile forwarding rules,
# VPC resolver alignment, and private hosted zone reachability from inside pods.
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
CONTEXT="${K8S_CONTEXT:-}"

K8S_FLAGS=""
if [ -n "$CONTEXT" ]; then
  K8S_FLAGS="--context=$CONTEXT"
fi

echo "=== VPC DNS SETTINGS ==="
VPC_IDS=$(aws ec2 describe-vpcs --region "$REGION" \
  --query 'Vpcs[].VpcId' --output text)
for VPC in $VPC_IDS; do
  echo "--- VPC: $VPC ---"
  aws ec2 describe-vpc-attribute --vpc-id "$VPC" --attribute enableDnsSupport \
    --region "$REGION" --output json 2>/dev/null
  aws ec2 describe-vpc-attribute --vpc-id "$VPC" --attribute enableDnsHostnames \
    --region "$REGION" --output json 2>/dev/null
  echo "  DHCP Options:"
  DHCP_ID=$(aws ec2 describe-vpcs --vpc-ids "$VPC" --region "$REGION" \
    --query 'Vpcs[0].DhcpOptionsId' --output text)
  aws ec2 describe-dhcp-options --dhcp-options-ids "$DHCP_ID" \
    --region "$REGION" \
    --query 'DhcpOptions[0].DhcpConfigurations' \
    --output json 2>/dev/null
done

echo ""
echo "=== ROUTE 53 PRIVATE HOSTED ZONES AND VPC ASSOCIATIONS ==="
aws route53 list-hosted-zones \
  --query 'HostedZones[?Config.PrivateZone==`true`].{Name:Name,ID:Id,RecordCount:ResourceRecordSetCount}' \
  --output json

# Check which private zones are associated with which VPCs
PRIVATE_ZONE_IDS=$(aws route53 list-hosted-zones \
  --query 'HostedZones[?Config.PrivateZone==`true`].Id' \
  --output text)
for ZONE_ID in $PRIVATE_ZONE_IDS; do
  SHORT_ID=$(echo "$ZONE_ID" | sed 's|/hostedzone/||')
  echo ""
  echo "--- Zone $SHORT_ID VPC associations ---"
  aws route53 get-hosted-zone --id "$SHORT_ID" \
    --query 'VPCs' --output json 2>/dev/null || echo "Could not retrieve zone details"
done

echo ""
echo "=== ROUTE 53 RESOLVER RULES (inbound/outbound endpoints) ==="
aws route53resolver list-resolver-rules --region "$REGION" --output json 2>/dev/null || echo "No resolver rules or insufficient permissions"
aws route53resolver list-resolver-endpoints --region "$REGION" --output json 2>/dev/null || echo "No resolver endpoints"

echo ""
echo "=== COREDNS COREFILE (forwarding rules) ==="
if kubectl $K8S_FLAGS get configmap coredns -n kube-system -o yaml 2>/dev/null; then
  echo ""
else
  echo "Could not retrieve CoreDNS ConfigMap (kubectl not configured or cluster unreachable)"
fi

echo ""
echo "=== POD resolv.conf CONFIGURATION ==="
# Inspect a sample pod from each namespace to check ndots and search domains
for NS in $(kubectl $K8S_FLAGS get namespaces -o jsonpath='{.items[*].metadata.name}' 2>/dev/null | tr ' ' '\n' | grep -v "kube-\|cert-\|external-dns" | head -5); do
  SAMPLE_POD=$(kubectl $K8S_FLAGS get pods -n "$NS" -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")
  if [ -n "$SAMPLE_POD" ]; then
    echo "--- $NS/$SAMPLE_POD ---"
    kubectl $K8S_FLAGS exec "$SAMPLE_POD" -n "$NS" -- cat /etc/resolv.conf 2>/dev/null || echo "  (exec not available)"
  fi
done

echo ""
echo "=== DNS RESOLUTION PATH COMPARISON ==="
# Run a test pod that resolves the same name via different methods to expose path asymmetry
TEST_NS="default"
TEST_POD="dns-path-test-$(date +%s)"
kubectl $K8S_FLAGS run "$TEST_POD" -n "$TEST_NS" \
  --image=busybox:1.28 --rm --restart=Never -it -- sh -c '
  echo "--- ndots check: count dots in resolv.conf ---"
  grep ndots /etc/resolv.conf || echo "ndots not set (default is 5)"

  echo ""
  echo "--- search domains in resolv.conf ---"
  grep search /etc/resolv.conf

  echo ""
  echo "--- query count test: how many queries does a single external lookup generate? ---"
  echo "Timing nslookup s3.amazonaws.com (unqualified, subject to search expansion):"
  time nslookup s3.amazonaws.com
  echo ""
  echo "Timing nslookup s3.amazonaws.com. (trailing dot, bypasses search expansion):"
  time nslookup s3.amazonaws.com.

  echo ""
  echo "--- internal RDS endpoint resolution test ---"
  echo "If your RDS endpoint is internal, substitute it below and check if it resolves"
  nslookup rds.amazonaws.com || echo "rds.amazonaws.com failed (expected if private zone not configured)"

  echo ""
  echo "--- SRV record presence check for kube-dns ---"
  nslookup -type=SRV _dns._udp.kube-dns.kube-system.svc.cluster.local || echo "SRV lookup failed"
' 2>/dev/null || echo "Could not run DNS path test pod"

echo ""
echo "=== COREDNS METRICS (if prometheus endpoint accessible) ==="
COREDNS_PODS=$(kubectl $K8S_FLAGS get pods -n kube-system -l k8s-app=kube-dns \
  -o jsonpath='{.items[*].metadata.name}' 2>/dev/null)
for POD in $COREDNS_PODS; do
  echo "--- $POD metrics snapshot ---"
  kubectl $K8S_FLAGS exec "$POD" -n kube-system -- \
    wget -qO- http://localhost:9153/metrics 2>/dev/null | \
    grep -E "^coredns_dns_(requests_total|responses_total|forward_requests_total|forward_healthcheck_failures_total)" | \
    head -30 || echo "Metrics endpoint not reachable inside pod"
done
EOF
chmod +x ./diag-dns-paths.sh
cat > ./prompt-dns-paths.sh << 'EOF'
#!/bin/bash
set -euo pipefail
./diag-dns-paths.sh | ./bedrock-ask.sh \
  "Analyse this DNS path and resolution data from an EKS production environment. You are looking for heterogeneous DNS resolution: conditions where some queries resolve correctly and others fail or take different paths, producing intermittent or service-specific failures that are hard to attribute to DNS.

Specifically investigate: whether the VPC has enableDnsSupport and enableDnsHostnames enabled, since disabling either breaks all AWS private hosted zone resolution from within the VPC; whether private Route 53 hosted zones are associated with the correct VPC, since a zone that exists but is not associated with the pod's VPC returns NXDOMAIN silently; whether the CoreDNS Corefile forward directive points to a valid upstream, specifically whether it reads /etc/resolv.conf (which picks up the VPC resolver dynamically) or whether it hardcodes an IP address that may have changed after a DHCP options update; whether ndots:5 is configured in pods and whether the search domain list means external queries are generating 5 DNS lookups instead of 1 (the timing comparison between a bare domain and a trailing-dot domain reveals this directly); whether Route 53 Resolver outbound endpoints exist and whether their rules would intercept queries that should go to the VPC resolver; and whether any namespace's pod resolv.conf differs from what CoreDNS should be providing, which would indicate a dnsPolicy misconfiguration on those pods.

The most dangerous scenario to identify is when internal RDS, ElastiCache, or service mesh endpoints are being forwarded to a public upstream resolver rather than resolved via the VPC. Those names are only visible inside the VPC and will return NXDOMAIN from any public resolver, producing application connection failures that look like network problems rather than DNS problems. Confirm or rule out this scenario based on the Corefile forwarding configuration and the private hosted zone VPC association data."
EOF
chmod +x ./prompt-dns-paths.sh

6. Database Diagnostics

Database issues during a production incident fall into three families: availability and connectivity failures, performance failures from slow queries or lock contention, and structural failures from missing indexes or bad execution plans. The Performance Insights API gives you access to the wait event data that reveals which of these is happening without needing to connect directly to the database.

6.1 RDS Instance and Cluster State

cat > ./diag-rds-state.sh << 'EOF'
#!/bin/bash
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"

echo "=== RDS INSTANCE STATUS ==="
aws rds describe-db-instances \
  --region "$REGION" \
  --query 'DBInstances[].{
    ID:DBInstanceIdentifier,
    Class:DBInstanceClass,
    Engine:Engine,
    EngineVersion:EngineVersion,
    Status:DBInstanceStatus,
    MultiAZ:MultiAZ,
    StorageGB:AllocatedStorage,
    IOPS:Iops,
    StorageType:StorageType,
    PubliclyAccessible:PubliclyAccessible,
    BackupRetention:BackupRetentionPeriod,
    DeletionProtection:DeletionProtection,
    PerformanceInsights:PerformanceInsightsEnabled,
    MonitoringInterval:MonitoringInterval,
    CACert:CACertificateIdentifier,
    LatestRestoreTime:LatestRestorableTime,
    MaintenanceWindow:PreferredMaintenanceWindow,
    BackupWindow:PreferredBackupWindow,
    PendingModified:PendingModifiedValues
  }' \
  --output json

echo ""
echo "=== AURORA CLUSTER STATUS ==="
aws rds describe-db-clusters \
  --region "$REGION" \
  --query 'DBClusters[].{
    ID:DBClusterIdentifier,
    Engine:Engine,
    Status:Status,
    Members:DBClusterMembers,
    ReaderEndpoint:ReaderEndpoint,
    WriterEndpoint:Endpoint,
    MultiAZ:MultiAZ,
    BackupRetention:BackupRetentionPeriod,
    ActivityStreamStatus:ActivityStreamStatus
  }' \
  --output json

echo ""
echo "=== RDS EVENTS (last 4 hours) ==="
ANALYSIS_HOURS="${ANALYSIS_HOURS:-24}"
START_TIME=$(python3 -c "from datetime import datetime, timedelta; print((datetime.utcnow() - timedelta(hours=${ANALYSIS_HOURS})).isoformat())")
aws rds describe-events \
  --start-time "$START_TIME" \
  --region "$REGION" \
  --output json

echo ""
echo "=== RDS CLOUDWATCH METRICS (last 30 min) ==="
INSTANCES=$(aws rds describe-db-instances \
  --region "$REGION" \
  --query 'DBInstances[].DBInstanceIdentifier' \
  --output text)

ANALYSIS_HOURS="${ANALYSIS_HOURS:-24}"
START=$(python3 -c "from datetime import datetime, timedelta; print((datetime.utcnow() - timedelta(hours=${ANALYSIS_HOURS})).strftime('%Y-%m-%dT%H:%M:%SZ'))")
END=$(python3 -c "from datetime import datetime; print(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))")

for INSTANCE in $INSTANCES; do
  echo ""
  echo "--- Instance: $INSTANCE ---"
  for METRIC in CPUUtilization FreeableMemory DatabaseConnections ReadLatency WriteLatency ReadIOPS WriteIOPS FreeStorageSpace ReplicaLag; do
    VALUE=$(aws cloudwatch get-metric-statistics \
      --namespace AWS/RDS \
      --metric-name "$METRIC" \
      --dimensions Name=DBInstanceIdentifier,Value="$INSTANCE" \
      --start-time "$START" \
      --end-time "$END" \
      --period 300 \
      --statistics Average Maximum \
      --region "$REGION" \
      --query 'sort_by(Datapoints, &Timestamp)[-1].[Average,Maximum]' \
      --output text 2>/dev/null || echo "N/A N/A")
    echo "$METRIC: avg=$(echo $VALUE | awk '{print $1}') max=$(echo $VALUE | awk '{print $2}')"
  done
done
EOF
chmod +x ./diag-rds-state.sh
cat > ./prompt-rds-state.sh << 'EOF'
#!/bin/bash
set -euo pipefail
./diag-rds-state.sh | ./bedrock-ask.sh \
  "Analyse this RDS infrastructure state during a production incident. Identify: instances not in available status, recent failover events in the RDS event history, CPU or memory metrics that are saturated, connection counts approaching the max_connections parameter limit, high read or write latency values that would explain application slowdowns, storage running low, replica lag values on read replicas that might indicate the replica is falling behind, instances without Performance Insights enabled (critical missing diagnostic capability), instances that are publicly accessible, backup retention periods below 7 days, and pending modifications that might have triggered a restart. For Aurora clusters, examine whether the writer and reader endpoints have the expected number of healthy members."
EOF
chmod +x ./prompt-rds-state.sh

6.2 Performance Insights: Slow Queries and Wait Events

Performance Insights is the most powerful tool available without needing direct database access. It shows you top SQL statements by load, grouped by wait event, giving you a complete picture of what the database is spending time on.

cat > ./diag-rds-pi.sh << 'EOF'
#!/bin/bash
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
INSTANCE_ID="${1:-}"

if [ -z "$INSTANCE_ID" ]; then
  echo "Usage: $0 <db-instance-id>"
  echo ""
  echo "Available instances:"
  aws rds describe-db-instances \
    --region "$REGION" \
    --query 'DBInstances[].DBInstanceIdentifier' \
    --output text | tr '\t' '\n'
  exit 1
fi

DBI_RESOURCE_ID=$(aws rds describe-db-instances \
  --db-instance-identifier "$INSTANCE_ID" \
  --region "$REGION" \
  --query 'DBInstances[0].DbiResourceId' \
  --output text)

echo "=== PERFORMANCE INSIGHTS: DB LOAD BY WAIT (last 30 min) ==="
echo "Instance: $INSTANCE_ID | Resource ID: $DBI_RESOURCE_ID"

ANALYSIS_HOURS="${ANALYSIS_HOURS:-24}"
START=$(python3 -c "from datetime import datetime, timedelta; print((datetime.utcnow() - timedelta(hours=${ANALYSIS_HOURS})).strftime('%Y-%m-%dT%H:%M:%SZ'))")
END=$(python3 -c "from datetime import datetime; print(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))")

echo ""
echo "--- Top SQL by DB Load ---"
aws pi get-resource-metrics \
  --service-type RDS \
  --identifier "db:${DBI_RESOURCE_ID}" \
  --start-time "$START" \
  --end-time "$END" \
  --period-in-seconds 300 \
  --metric-queries '[
    {
      "Metric": "db.load.avg",
      "GroupBy": {
        "Group": "db.sql",
        "Dimensions": ["db.sql.statement"],
        "Limit": 10
      }
    }
  ]' \
  --region "$REGION" \
  --output json 2>/dev/null || echo "Performance Insights may not be enabled on this instance"

echo ""
echo "--- Top Wait Events ---"
aws pi get-resource-metrics \
  --service-type RDS \
  --identifier "db:${DBI_RESOURCE_ID}" \
  --start-time "$START" \
  --end-time "$END" \
  --period-in-seconds 300 \
  --metric-queries '[
    {
      "Metric": "db.load.avg",
      "GroupBy": {
        "Group": "db.wait_event",
        "Limit": 10
      }
    }
  ]' \
  --region "$REGION" \
  --output json 2>/dev/null || echo "Performance Insights unavailable"

echo ""
echo "--- Top SQL by Calls ---"
aws pi describe-dimension-keys \
  --service-type RDS \
  --identifier "db:${DBI_RESOURCE_ID}" \
  --start-time "$START" \
  --end-time "$END" \
  --metric "db.load.avg" \
  --group-by '{"Group":"db.sql","Limit":15}' \
  --region "$REGION" \
  --output json 2>/dev/null || echo "Performance Insights unavailable"
EOF
chmod +x ./diag-rds-pi.sh
cat > ./prompt-rds-pi.sh << 'EOF'
#!/bin/bash
set -euo pipefail
INSTANCE_ID="${1:-}"
if [ -z "$INSTANCE_ID" ]; then
  echo "Usage: $0 <db-instance-id>"
  exit 1
fi
./diag-rds-pi.sh "$INSTANCE_ID" | ./bedrock-ask.sh \
  "Analyse this Performance Insights data for a production database incident. The DB load metric represents average active sessions. Values above the number of vCPUs indicate saturation. Identify: wait events that dominate DB load since these reveal whether the bottleneck is IO, lock contention, CPU, or network, specific SQL statements with the highest average load contribution suggesting they need index or query plan attention, lock or latch wait events that indicate contention between concurrent queries, IO-related waits that might indicate storage saturation or missing indexes causing full table scans, and sudden spikes in load that correlate with the incident start time. A database with high io/table_lock_wait is almost certainly suffering from a query without an appropriate index."
EOF
chmod +x ./prompt-rds-pi.sh

6.3 Slow Query Logs and Execution Plans

When you have identified a suspect query from Performance Insights, the next step is to pull the slow query log and examine execution plans. The following script exports slow query log events from CloudWatch Logs if the instance is configured to export them.

cat > ./diag-rds-slow-queries.sh << 'EOF'
#!/bin/bash
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
INSTANCE_ID="${1:-}"
ANALYSIS_HOURS="${ANALYSIS_HOURS:-24}"
MINUTES_BACK="${2:-$(( ANALYSIS_HOURS * 60 ))}"

if [ -z "$INSTANCE_ID" ]; then
  echo "Usage: $0 <db-instance-id> [minutes-back]"
  exit 1
fi

LOG_GROUP="/aws/rds/instance/${INSTANCE_ID}/slowquery"

echo "=== SLOW QUERY LOG: $INSTANCE_ID (last ${MINUTES_BACK} minutes) ==="
echo "Log group: $LOG_GROUP"
echo ""

START_TIME=$(python3 -c "import time; print(int((time.time() - ${MINUTES_BACK}*60) * 1000))")

QUERY_ID=$(aws logs start-query \
  --log-group-name "$LOG_GROUP" \
  --start-time "$START_TIME" \
  --end-time "$(python3 -c 'import time; print(int(time.time() * 1000))')" \
  --query-string 'fields @timestamp, @message
    | filter @message like /Query_time/
    | parse @message "# Query_time: * Lock_time: * Rows_sent: * Rows_examined: *" as query_time, lock_time, rows_sent, rows_examined
    | sort @timestamp desc
    | limit 50' \
  --region "$REGION" \
  --query 'queryId' \
  --output text 2>/dev/null) || {
    echo "Slow query log group not found. Check that slow_query_log is enabled and logs are exported to CloudWatch."
    echo ""
    echo "=== RDS LOG FILES AVAILABLE ==="
    aws rds describe-db-log-files \
      --db-instance-identifier "$INSTANCE_ID" \
      --region "$REGION" \
      --output json
    exit 0
  }

sleep 8

aws logs get-query-results \
  --query-id "$QUERY_ID" \
  --region "$REGION" \
  --output json

echo ""
echo "=== SLOW QUERY PATTERN SUMMARY ==="
QUERY_ID2=$(aws logs start-query \
  --log-group-name "$LOG_GROUP" \
  --start-time "$START_TIME" \
  --end-time "$(python3 -c 'import time; print(int(time.time() * 1000))')" \
  --query-string 'fields @message
    | filter @message like /SELECT|INSERT|UPDATE|DELETE/
    | parse @message "SET timestamp=*;\n*" as ts, sql
    | stats count(*) as occurrences by sql
    | sort occurrences desc
    | limit 20' \
  --region "$REGION" \
  --query 'queryId' \
  --output text 2>/dev/null || echo "none")

if [ "$QUERY_ID2" != "none" ]; then
  sleep 8
  aws logs get-query-results \
    --query-id "$QUERY_ID2" \
    --region "$REGION" \
    --output json
fi
EOF
chmod +x ./diag-rds-slow-queries.sh
cat > ./prompt-rds-slow-queries.sh << 'EOF'
#!/bin/bash
set -euo pipefail
INSTANCE_ID="${1:-}"
if [ -z "$INSTANCE_ID" ]; then
  echo "Usage: $0 <db-instance-id>"
  exit 1
fi
./diag-rds-slow-queries.sh "$INSTANCE_ID" 60 | ./bedrock-ask.sh \
  "Analyse these slow query logs from a production RDS instance. Look for: queries with very high Rows_examined relative to Rows_sent which is the signature of a full table scan or a severely under-selective index, queries that appear repeatedly suggesting they are being called at high frequency without a result cache, queries with high Lock_time indicating they are blocked waiting for row or table locks, SELECT statements on large tables without a WHERE clause or with WHERE clauses on unindexed columns, and any patterns in the timing of slow queries that correlate with the start of the production incident. High Rows_examined with low Rows_sent is almost always an index problem. Recommend specific index candidates based on the query structure where possible."
EOF
chmod +x ./prompt-rds-slow-queries.sh

6.4 Aurora PostgreSQL Query Plan Regression and QPM

There is a category of Aurora PostgreSQL incident that is particularly insidious because it appears gradually, worsens unpredictably, and can cause OOM events and cluster restarts that are traced to the wrong cause. The pattern is query plan regression: the PostgreSQL planner switches from an efficient plan to an inefficient one, usually in response to a statistics update, a schema change, a parameter group modification, or a minor Aurora engine version upgrade. The new plan may involve a sequential scan where an index scan previously existed, a hash join with a large in-memory hash table where a nested loop was previously used, or an unexpected sort spill to disk. Each of these consumes significantly more memory per connection, and on clusters with high connection counts the aggregate effect is rapid memory exhaustion followed by OOM events that look like instance sizing problems rather than query plan problems.

Aurora PostgreSQL includes the apg_plan_mgmt extension, which provides Query Plan Management (QPM). When enabled, QPM captures execution plans for qualifying SQL statements, allows you to approve specific plans, and enforces only approved plans regardless of what the planner would choose based on current statistics. The diag-aurora-qpm.sh script below extracts QPM state, identifies rejected and unapproved plans that are currently being bypassed, correlates plan change timestamps against the incident window, and pulls work_mem configuration which determines whether sort operations and hash joins spill to disk.

cat > ./diag-aurora-qpm.sh << 'EOF'
#!/bin/bash
# diag-aurora-qpm.sh: Collect Aurora PostgreSQL query plan state for Bedrock analysis.
# Requires that psql is available and that the DB_ENDPOINT, DB_USER, and DB_NAME
# variables are set, or that a .pgpass file exists for passwordless auth.
# Falls back to CloudWatch and RDS API data when direct DB access is not available.
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
INSTANCE_ID="${1:-}"
DB_ENDPOINT="${DB_ENDPOINT:-}"
DB_USER="${DB_USER:-postgres}"
DB_NAME="${DB_NAME:-postgres}"
DB_PORT="${DB_PORT:-5432}"

if [ -z "$INSTANCE_ID" ]; then
  echo "Usage: $0 <db-instance-id>"
  echo ""
  echo "Available Aurora PostgreSQL instances:"
  aws rds describe-db-instances \
    --region "$REGION" \
    --query 'DBInstances[?Engine==`aurora-postgresql`].DBInstanceIdentifier' \
    --output text | tr '\t' '\n'
  exit 1
fi

echo "=== AURORA INSTANCE CONFIGURATION ==="
aws rds describe-db-instances \
  --db-instance-identifier "$INSTANCE_ID" \
  --region "$REGION" \
  --query 'DBInstances[0].{
    ID:DBInstanceIdentifier,
    Class:DBInstanceClass,
    Engine:Engine,
    EngineVersion:EngineVersion,
    Status:DBInstanceStatus,
    ParameterGroup:DBParameterGroups[0].DBParameterGroupName,
    ClusterID:DBClusterIdentifier
  }' \
  --output json

echo ""
echo "=== PARAMETER GROUP: QPM AND MEMORY SETTINGS ==="
PARAM_GROUP=$(aws rds describe-db-instances \
  --db-instance-identifier "$INSTANCE_ID" \
  --region "$REGION" \
  --query 'DBInstances[0].DBParameterGroups[0].DBParameterGroupName' \
  --output text)

if [ -n "$PARAM_GROUP" ] && [ "$PARAM_GROUP" != "None" ]; then
  echo "Parameter group: $PARAM_GROUP"
  # QPM parameters
  for PARAM in rds.enable_plan_management apg_plan_mgmt.capture_plan_baselines \
                apg_plan_mgmt.use_plan_baselines apg_plan_mgmt.max_plans \
                apg_plan_mgmt.unapproved_plan_execution_threshold \
                work_mem maintenance_work_mem effective_cache_size \
                shared_buffers max_connections random_page_cost; do
    VALUE=$(aws rds describe-db-parameters \
      --db-parameter-group-name "$PARAM_GROUP" \
      --region "$REGION" \
      --query "Parameters[?ParameterName=='${PARAM}'].{Value:ParameterValue,Source:Source,ApplyMethod:ApplyMethod}" \
      --output json 2>/dev/null | python3 -c "import json,sys; d=json.load(sys.stdin); print(json.dumps(d[0]) if d else 'not set')" 2>/dev/null || echo "not found")
    echo "  $PARAM: $VALUE"
  done
fi

echo ""
echo "=== CLUSTER PARAMETER GROUP ==="
CLUSTER_ID=$(aws rds describe-db-instances \
  --db-instance-identifier "$INSTANCE_ID" \
  --region "$REGION" \
  --query 'DBInstances[0].DBClusterIdentifier' \
  --output text)

if [ -n "$CLUSTER_ID" ] && [ "$CLUSTER_ID" != "None" ]; then
  CLUSTER_PARAM_GROUP=$(aws rds describe-db-clusters \
    --db-cluster-identifier "$CLUSTER_ID" \
    --region "$REGION" \
    --query 'DBClusters[0].DBClusterParameterGroup' \
    --output text 2>/dev/null || echo "")
  if [ -n "$CLUSTER_PARAM_GROUP" ]; then
    echo "Cluster parameter group: $CLUSTER_PARAM_GROUP"
    for PARAM in rds.enable_plan_management apg_plan_mgmt.capture_plan_baselines \
                  apg_plan_mgmt.use_plan_baselines; do
      VALUE=$(aws rds describe-db-cluster-parameters \
        --db-cluster-parameter-group-name "$CLUSTER_PARAM_GROUP" \
        --region "$REGION" \
        --query "Parameters[?ParameterName=='${PARAM}'].ParameterValue" \
        --output text 2>/dev/null || echo "not found")
      echo "  $PARAM: $VALUE"
    done
  fi
fi

echo ""
echo "=== CLOUDWATCH: MEMORY AND CPU LAST 2 HOURS ==="
ANALYSIS_HOURS="${ANALYSIS_HOURS:-24}"
START=$(python3 -c "from datetime import datetime, timedelta; print((datetime.utcnow() - timedelta(hours=${ANALYSIS_HOURS})).strftime('%Y-%m-%dT%H:%M:%SZ'))")
END=$(python3 -c "from datetime import datetime; print(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))")

for METRIC in CPUUtilization FreeableMemory SwapUsage DatabaseConnections; do
  echo "--- $METRIC ---"
  aws cloudwatch get-metric-statistics \
    --namespace AWS/RDS \
    --metric-name "$METRIC" \
    --dimensions Name=DBInstanceIdentifier,Value="$INSTANCE_ID" \
    --start-time "$START" --end-time "$END" \
    --period 300 --statistics Average Maximum \
    --region "$REGION" \
    --query 'sort_by(Datapoints,&Timestamp)[*].[Timestamp,Average,Maximum]' \
    --output text 2>/dev/null | tail -24
done

echo ""
echo "=== RDS EVENTS: RESTARTS AND OOM (last 24 hours) ==="
ANALYSIS_HOURS="${ANALYSIS_HOURS:-24}"
START_24=$(python3 -c "from datetime import datetime, timedelta; print((datetime.utcnow() - timedelta(hours=${ANALYSIS_HOURS})).isoformat())")
aws rds describe-events \
  --source-identifier "$INSTANCE_ID" \
  --source-type db-instance \
  --start-time "$START_24" \
  --region "$REGION" \
  --query 'Events[?contains(Message, `restart`) || contains(Message, `recovery`) || contains(Message, `failover`) || contains(Message, `OOM`) || contains(Message, `memory`)].{Time:Date,Message:Message}' \
  --output json

echo ""
echo "=== DIRECT DB QUERY: QPM PLAN STATE ==="
if [ -n "$DB_ENDPOINT" ]; then
  echo "Connecting to $DB_ENDPOINT as $DB_USER..."

  psql -h "$DB_ENDPOINT" -U "$DB_USER" -d "$DB_NAME" -p "$DB_PORT" \
    --no-password -t -A -F'|' << 'SQLEOF' 2>/dev/null || echo "psql connection failed - set DB_ENDPOINT, DB_USER, DB_NAME and ensure .pgpass is configured"

\echo '--- QPM extension status ---'
SELECT extname, extversion FROM pg_extension WHERE extname = 'apg_plan_mgmt';

\echo '--- Plan baseline summary by status ---'
SELECT status, enabled, count(*) as plan_count
FROM apg_plan_mgmt.dba_plans
GROUP BY status, enabled
ORDER BY status, enabled;

\echo '--- Most recently changed plans (last 24 hours) ---'
SELECT sql_hash, plan_hash, status, enabled,
       last_used, first_used,
       total_plan_time_ms, calls,
       CASE WHEN calls > 0 THEN total_plan_time_ms / calls ELSE 0 END as avg_ms_per_call
FROM apg_plan_mgmt.dba_plans
WHERE last_used > now() - interval '24 hours'
   OR first_used > now() - interval '24 hours'
ORDER BY last_used DESC NULLS LAST
LIMIT 30;

\echo '--- Rejected or unapproved plans currently in use ---'
SELECT sql_hash, plan_hash, status, enabled, calls, total_plan_time_ms
FROM apg_plan_mgmt.dba_plans
WHERE (status = 'Rejected' OR status = 'Unapproved')
  AND calls > 0
ORDER BY calls DESC
LIMIT 20;

\echo '--- Current work_mem and sort method settings ---'
SELECT name, setting, unit, source
FROM pg_settings
WHERE name IN ('work_mem', 'maintenance_work_mem', 'effective_cache_size',
               'max_connections', 'shared_buffers', 'temp_buffers',
               'random_page_cost', 'enable_hashjoin', 'enable_seqscan',
               'enable_nestloop', 'enable_mergejoin');

\echo '--- Active queries with high memory or long runtime ---'
SELECT pid, now() - query_start as duration, state,
       wait_event_type, wait_event,
       left(query, 200) as query_snippet
FROM pg_stat_activity
WHERE state != 'idle'
  AND query_start < now() - interval '30 seconds'
ORDER BY duration DESC
LIMIT 20;

\echo '--- Temporary file usage (sign of work_mem spill) ---'
SELECT datname, temp_files, temp_bytes,
       blks_read, blks_hit,
       CASE WHEN blks_read + blks_hit > 0
            THEN round(100.0 * blks_hit / (blks_read + blks_hit), 2)
            ELSE 0 END as cache_hit_pct
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1')
ORDER BY temp_bytes DESC;

\echo '--- Tables with stale statistics (bloated dead tuples) ---'
SELECT schemaname, tablename,
       n_live_tup, n_dead_tup,
       CASE WHEN n_live_tup > 0
            THEN round(100.0 * n_dead_tup / n_live_tup, 1)
            ELSE 0 END as dead_tup_pct,
       last_analyze, last_autoanalyze,
       last_vacuum, last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
   OR (n_live_tup > 0 AND n_dead_tup::float / n_live_tup > 0.1)
ORDER BY n_dead_tup DESC
LIMIT 20;

SQLEOF
else
  echo "DB_ENDPOINT not set. Direct query section skipped."
  echo "To enable direct query analysis, set: export DB_ENDPOINT=<your-cluster-endpoint>"
  echo "and ensure .pgpass or PGPASSWORD is configured for passwordless auth."
  echo ""
  echo "Without direct DB access, the CloudWatch and RDS API data above is still"
  echo "sufficient for Bedrock to identify memory pressure patterns and plan instability."
fi
EOF
chmod +x ./diag-aurora-qpm.sh
cat > ./prompt-aurora-qpm.sh << 'EOF'
#!/bin/bash
set -euo pipefail
INSTANCE_ID="${1:-}"
if [ -z "$INSTANCE_ID" ]; then
  echo "Usage: $0 <db-instance-id>"
  exit 1
fi
./diag-aurora-qpm.sh "$INSTANCE_ID" | ./bedrock-ask.sh \
  "Analyse this Aurora PostgreSQL data for query plan regression and memory growth issues.

The primary failure pattern to investigate is query plan regression: a change in the execution plan chosen by the PostgreSQL planner that causes a previously fast query to become slow or memory-intensive. This often manifests as rising FreeableMemory decline (memory growing without release), increasing swap usage, rising CPUUtilization, and eventually OOM restarts or cluster failovers. The application observes slow queries and then database connection failures, which are often incorrectly attributed to instance sizing.

Examine the following in the data you have been given. From the parameter group: check whether rds.enable_plan_management is 1 and whether apg_plan_mgmt.use_plan_baselines is On, since if QPM is not enforcing approved plans then any statistics update or engine upgrade can silently change execution plans. Check whether work_mem is set above 4MB, since the default 4MB means any query with a sort or hash join on a table larger than 4MB will spill to disk, and on a cluster with many concurrent connections the aggregate temporary file I/O can saturate storage. A work_mem setting that is appropriate for the cluster's peak connection count is: available_memory divided by (max_connections multiplied by average_sort_operations_per_query); values above 64MB on clusters with more than 100 connections require careful monitoring.

From the QPM plan state (if available): look for plans with status Rejected or Unapproved that are still accumulating calls, which means QPM is configured but the enforcement is incomplete; look for plans whose first_used timestamp corresponds to the incident start time, which strongly suggests a plan change triggered the incident; look for high avg_ms_per_call values combined with high call counts, which reveals the queries contributing most to total DB load.

From the pg_stat_database data: high temp_bytes indicates sort or hash join operations are spilling to disk due to insufficient work_mem, and this is one of the most common causes of gradual memory exhaustion in Aurora PostgreSQL because temporary files consume buffer pool memory and storage IOPS simultaneously.

From the statistics health data: tables with high dead_tup_pct and stale last_analyze timestamps have unreliable row count estimates, which causes the planner to misestimate join cardinality and choose hash joins with unexpectedly large hash tables. An autovacuum that has fallen behind on a high-churn table is a common trigger for sudden plan regression.

Correlate the CloudWatch memory and CPU trends against the RDS event log. If FreeableMemory shows a declining trend that began before the incident alert fired, this is a memory growth pattern consistent with plan regression rather than a sudden load spike, and the root cause is almost certainly a plan change or statistics staleness rather than increased traffic volume."
EOF
chmod +x ./prompt-aurora-qpm.sh

7. S3 Diagnostics

S3 failures during a production incident usually fall into three categories: access denial from a changed bucket policy or IAM permission, throttling from an application making too many requests to a prefix without random prefix distribution, and data integrity issues from a versioning or lifecycle policy change that has removed expected objects.

cat > ./diag-s3.sh << 'EOF'
#!/bin/bash
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
BUCKET_FILTER="${1:-}"

echo "=== ACCOUNT-LEVEL PUBLIC ACCESS BLOCK ==="
aws s3control get-public-access-block \
  --account-id "$(aws sts get-caller-identity --query Account --output text)" \
  --output json 2>/dev/null || echo "Account-level public access block not configured"

echo ""
echo "=== S3 BUCKET LIST ==="
aws s3api list-buckets \
  --query 'Buckets[].{Name:Name,Created:CreationDate}' \
  --output json

echo ""
echo "=== S3 CLOUDWATCH METRICS (last 30 min) ==="
ANALYSIS_HOURS="${ANALYSIS_HOURS:-24}"
START=$(python3 -c "from datetime import datetime, timedelta; print((datetime.utcnow() - timedelta(hours=${ANALYSIS_HOURS})).strftime('%Y-%m-%dT%H:%M:%SZ'))")
END=$(python3 -c "from datetime import datetime; print(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))")

BUCKETS=$(aws s3api list-buckets --query 'Buckets[].Name' --output text)

for BUCKET in $BUCKETS; do
  if [ -n "$BUCKET_FILTER" ] && [[ "$BUCKET" != *"$BUCKET_FILTER"* ]]; then
    continue
  fi

  echo ""
  echo "--- Bucket: $BUCKET ---"

  for METRIC in 4xxErrors 5xxErrors TotalRequestLatency AllRequests BytesDownloaded BytesUploaded; do
    VALUE=$(aws cloudwatch get-metric-statistics \
      --namespace AWS/S3 \
      --metric-name "$METRIC" \
      --dimensions Name=BucketName,Value="$BUCKET" Name=FilterId,Value=EntireBucket \
      --start-time "$START" \
      --end-time "$END" \
      --period 300 \
      --statistics Sum Average \
      --region "$REGION" \
      --query 'sort_by(Datapoints, &Timestamp)[-1].[Sum,Average]' \
      --output text 2>/dev/null || echo "N/A N/A")
    echo "  $METRIC: sum=$(echo $VALUE | awk '{print $1}') avg=$(echo $VALUE | awk '{print $2}')"
  done

  echo "  PublicAccessBlock:"
  aws s3api get-public-access-block \
    --bucket "$BUCKET" \
    --query 'PublicAccessBlockConfiguration' \
    --output json 2>/dev/null || echo "    (not configured or access denied)"

  echo "  Versioning:"
  aws s3api get-bucket-versioning \
    --bucket "$BUCKET" \
    --output json 2>/dev/null || echo "    (access denied)"

  echo "  Lifecycle rules:"
  aws s3api get-bucket-lifecycle-configuration \
    --bucket "$BUCKET" \
    --query 'Rules[].{ID:ID,Status:Status,Expiration:Expiration,Transitions:Transitions}' \
    --output json 2>/dev/null || echo "    (none or access denied)"
done

echo ""
echo "=== S3 ACCESS POINTS ==="
aws s3control list-access-points \
  --account-id "$(aws sts get-caller-identity --query Account --output text)" \
  --output json 2>/dev/null || echo "No access points or insufficient permissions"
EOF
chmod +x ./diag-s3.sh
cat > ./prompt-s3.sh << 'EOF'
#!/bin/bash
set -euo pipefail
BUCKET_FILTER="${1:-}"
./diag-s3.sh "$BUCKET_FILTER" | ./bedrock-ask.sh \
  "Analyse this S3 infrastructure data during a production incident where applications may be failing to read or write objects. Identify: the account-level public access block status, buckets with high 4xx error rates which indicate access denial or invalid requests, buckets with elevated 5xx error rates which indicate S3 service-side throttling, buckets with public access block not configured, lifecycle rules with short expiration windows that might have recently deleted objects the application expects to find, versioning disabled on buckets where it should be enabled for durability, and latency patterns in TotalRequestLatency that might indicate a prefix hotspot. A bucket serving 4xx errors to an application that was working previously almost always means a bucket policy or IAM change occurred recently."
EOF
chmod +x ./prompt-s3.sh

8. Cache Diagnostics: ElastiCache and DAX

Cache failures are among the most dangerous production incidents because they are frequently invisible until they cascade. A cache that silently degrades has a different failure signature from one that is hard down: hit rates fall slowly, evictions climb, application latency increases, and database connection counts rise as cache misses drive through to RDS or DynamoDB. By the time the application is noticeably failing, the cache has usually been in distress for minutes or hours.

There are three distinct cache failure modes to check. The first is memory pressure, where the cache is running out of space and evicting keys. In a session cache or a read-through cache, evictions mean data the application expected to find is not there, and the request falls through to the database. If the eviction rate is high enough, the database receives every request as a miss and the cache provides no benefit at all. The second is replication lag, where replica nodes are serving stale data because they have fallen behind the primary, typically due to a heavy write workload or a slow network link. The third is cluster resharding, where ElastiCache is redistributing slots across shards in response to a scale-out or scale-in operation. Resharding is designed to be online and non-disruptive, but it is compute-intensive and increases write latency on slots that are currently migrating. Applications with tight timeout budgets will see errors during resharding if they were not designed for it.

For DAX specifically, the cache hit rate must stay above 90% for the cluster to provide any material benefit. Below that threshold, the volume of DynamoDB pass-through requests begins to consume cluster resources without reducing DynamoDB load, and the cluster can become a bottleneck rather than an accelerator.

One metric worth calling out explicitly: for ElastiCache Redis, EngineCPUUtilization is the metric to watch for CPU, not CPUUtilization. Redis is single-threaded for most operations, so CPUUtilization reflects the entire node across all cores and significantly underrepresents the load on the Redis process itself. A cluster showing 25% CPUUtilization may have its Redis engine thread at 90%.

cat > ./diag-cache.sh << 'EOF'
#!/bin/bash
# diag-cache.sh: Collect ElastiCache Redis/Valkey/Memcached and DAX diagnostics.
# Covers hit/miss rates, evictions, memory fragmentation, replication lag,
# cluster resharding state, connection counts, and slow log configuration.
# All evidence written to EVIDENCE_DIR before Bedrock analysis.
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
ANALYSIS_HOURS="${ANALYSIS_HOURS:-24}"
EVIDENCE_DIR="${EVIDENCE_DIR:-./evidence}"
mkdir -p "$EVIDENCE_DIR"

START=$(python3 -c "from datetime import datetime, timedelta; print((datetime.utcnow() - timedelta(hours=${ANALYSIS_HOURS})).strftime('%Y-%m-%dT%H:%M:%SZ'))")
END=$(python3 -c "from datetime import datetime; print(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))")

echo "=== ELASTICACHE CLUSTERS ==="
aws elasticache describe-cache-clusters \
  --show-cache-node-info \
  --region "$REGION" \
  --query 'CacheClusters[*].{
    ID:CacheClusterId,
    Engine:Engine,
    EngineVersion:EngineVersion,
    NodeType:CacheNodeType,
    Status:CacheClusterStatus,
    NumNodes:NumCacheNodes,
    ReplicationGroupId:ReplicationGroupId,
    MultiAZ:AutoMinorVersionUpgrade,
    EncryptionAtRest:AtRestEncryptionEnabled,
    EncryptionInTransit:TransitEncryptionEnabled,
    SlowLogEnabled:LogDeliveryConfigurations
  }' \
  --output json | tee "$EVIDENCE_DIR/elasticache-clusters.json"

echo ""
echo "=== ELASTICACHE REPLICATION GROUPS (Redis cluster mode) ==="
aws elasticache describe-replication-groups \
  --region "$REGION" \
  --query 'ReplicationGroups[*].{
    ID:ReplicationGroupId,
    Description:Description,
    Status:Status,
    ClusterMode:ClusterEnabled,
    MultiAZ:MultiAZ,
    AutoFailover:AutomaticFailover,
    NodeGroups:NodeGroups[*].{ID:NodeGroupId,Status:Status,Slots:Slots,Members:NodeGroupMembers[*].{ID:CacheClusterId,Role:CurrentRole,Status:CurrentStatus}},
    AtRestEncryption:AtRestEncryptionEnabled,
    TransitEncryption:TransitEncryptionEnabled,
    DataTieringEnabled:DataTieringEnabled
  }' \
  --output json | tee "$EVIDENCE_DIR/elasticache-replication-groups.json"

echo ""
echo "=== ELASTICACHE EVENTS (last ${ANALYSIS_HOURS}h) ==="
aws elasticache describe-events \
  --region "$REGION" \
  --start-time "$START" \
  --query 'Events[*].{Time:Date,Source:SourceIdentifier,Message:Message}' \
  --output json | tee "$EVIDENCE_DIR/elasticache-events.json"

echo ""
echo "=== CLOUDWATCH: ELASTICACHE METRICS ==="
CLUSTER_IDS=$(aws elasticache describe-cache-clusters \
  --region "$REGION" \
  --query 'CacheClusters[].CacheClusterId' \
  --output text 2>/dev/null | tr '\t' '\n') || CLUSTER_IDS=""

if [ -z "$CLUSTER_IDS" ]; then
  echo "No ElastiCache clusters found in $REGION or API call failed. Skipping per-cluster metrics."
fi

for CLUSTER_ID in $CLUSTER_IDS; do
  echo ""
  echo "--- Cluster: $CLUSTER_ID ---"

  for METRIC in \
    CacheHits CacheMisses \
    CacheHitRate \
    Evictions \
    CurrConnections NewConnections \
    FreeableMemory \
    DatabaseMemoryUsagePercentage \
    BytesUsedForCache \
    MemoryFragmentationRatio \
    EngineCPUUtilization CPUUtilization \
    ReplicationLag \
    SaveInProgress \
    CurrItems \
    NetworkBytesIn NetworkBytesOut \
    SwapUsage; do

    DATA=$(aws cloudwatch get-metric-statistics \
      --namespace AWS/ElastiCache \
      --metric-name "$METRIC" \
      --dimensions Name=CacheClusterId,Value="$CLUSTER_ID" \
      --start-time "$START" --end-time "$END" \
      --period 300 --statistics Average Maximum Sum \
      --region "$REGION" \
      --query 'sort_by(Datapoints,&Timestamp)[-6:].[Timestamp,Average,Maximum,Sum]' \
      --output json 2>/dev/null | \
      python3 -c "
import json,sys
d=json.load(sys.stdin)
if not d: print('  no data')
else:
    last=d[-1]; print(f'  latest: avg={last[1]} max={last[2]} sum={last[3]} at {last[0]}')
    if len(d)>1:
        trend='rising' if (d[-1][1] or 0) > (d[0][1] or 0) else 'stable/falling'
        print(f'  trend over window: {trend}')
" 2>/dev/null || echo "  no data")
    echo "  $METRIC:"
    echo "$DATA"
  done

  # Compute hit rate explicitly if CacheHits and CacheMisses available
  echo "  --- Computed hit rate over window ---"
  python3 << PYEOF
import subprocess, json
def get_sum(metric, cluster):
    r = subprocess.run([
        'aws','cloudwatch','get-metric-statistics',
        '--namespace','AWS/ElastiCache',
        '--metric-name', metric,
        '--dimensions', f'Name=CacheClusterId,Value={cluster}',
        '--start-time','$START','--end-time','$END',
        '--period','${ANALYSIS_HOURS}h'.replace('h','').replace('\${ANALYSIS_HOURS}','$ANALYSIS_HOURS'),
        '--period', str(${ANALYSIS_HOURS} * 3600),
        '--statistics','Sum',
        '--region','$REGION',
        '--query','Datapoints[0].Sum',
        '--output','text'
    ], capture_output=True, text=True, env=__import__('os').environ)
    try: return float(r.stdout.strip())
    except: return None

hits = get_sum('CacheHits', '$CLUSTER_ID')
misses = get_sum('CacheMisses', '$CLUSTER_ID')
if hits is not None and misses is not None:
    total = hits + misses
    if total > 0:
        rate = hits / total * 100
        print(f'  Hit rate over ${ANALYSIS_HOURS}h: {rate:.1f}% (hits={hits:.0f} misses={misses:.0f})')
        if rate < 80:
            print(f'  WARNING: hit rate below 80% - cache is not effectively absorbing load')
        elif rate < 90:
            print(f'  CAUTION: hit rate below 90% - worth investigating key expiry and eviction policy')
    else:
        print('  No cache traffic in window')
PYEOF

done

echo ""
echo "=== CLOUDWATCH: REPLICATION LAG ACROSS REPLICAS ==="
RG_IDS=$(aws elasticache describe-replication-groups \
  --region "$REGION" \
  --query 'ReplicationGroups[].ReplicationGroupId' \
  --output text 2>/dev/null | tr '\t' '\n') || RG_IDS=""

if [ -z "$RG_IDS" ]; then
  echo "No replication groups found or API call failed."
fi

for RG_ID in $RG_IDS; do
  echo "--- Replication group: $RG_ID ---"
  REPLICA_CLUSTER_IDS=$(aws elasticache describe-replication-groups \
    --replication-group-id "$RG_ID" \
    --region "$REGION" \
    --query 'ReplicationGroups[0].NodeGroups[*].NodeGroupMembers[?CurrentRole==`replica`].CacheClusterId' \
    --output text | tr '\t' '\n')

  for REPLICA_ID in $REPLICA_CLUSTER_IDS; do
    LAG=$(aws cloudwatch get-metric-statistics \
      --namespace AWS/ElastiCache \
      --metric-name ReplicationLag \
      --dimensions Name=CacheClusterId,Value="$REPLICA_ID" \
      --start-time "$START" --end-time "$END" \
      --period 300 --statistics Average Maximum \
      --region "$REGION" \
      --query 'sort_by(Datapoints,&Timestamp)[-1].[Average,Maximum]' \
      --output text 2>/dev/null || echo "N/A N/A")
    echo "  $REPLICA_ID: avg_lag=$(echo $LAG | awk '{print $1}')s max_lag=$(echo $LAG | awk '{print $2}')s"
  done
done

echo ""
echo "=== RESHARDING STATUS ==="
for RG_ID in $RG_IDS; do
  STATUS=$(aws elasticache describe-replication-groups \
    --replication-group-id "$RG_ID" \
    --region "$REGION" \
    --query 'ReplicationGroups[0].Status' \
    --output text 2>/dev/null || echo "unknown")
  NODE_GROUPS=$(aws elasticache describe-replication-groups \
    --replication-group-id "$RG_ID" \
    --region "$REGION" \
    --query 'ReplicationGroups[0].NodeGroups[*].{ID:NodeGroupId,Slots:Slots,Status:Status}' \
    --output json 2>/dev/null)
  echo "  $RG_ID: status=$STATUS"
  echo "  Shard slot assignments:"
  echo "$NODE_GROUPS" | python3 -c "
import json,sys
groups=json.load(sys.stdin)
for g in groups:
    print(f'    shard {g[\"ID\"]}: slots={g[\"Slots\"]} status={g[\"Status\"]}')
" 2>/dev/null
  if [ "$STATUS" = "modifying" ]; then
    echo "  *** RESHARDING IN PROGRESS: this increases write latency on migrating slots"
    echo "  *** Applications with tight timeouts may see errors during slot migration"
  fi
done

echo ""
echo "=== SLOW LOG CONFIGURATION ==="
for CLUSTER_ID in $CLUSTER_IDS; do
  LOG_CONFIG=$(aws elasticache describe-cache-clusters \
    --cache-cluster-id "$CLUSTER_ID" \
    --region "$REGION" \
    --query 'CacheClusters[0].LogDeliveryConfigurations' \
    --output json 2>/dev/null) || LOG_CONFIG=""
  if [ "$LOG_CONFIG" = "[]" ] || [ -z "$LOG_CONFIG" ]; then
    echo "  $CLUSTER_ID: slow logs NOT configured"
    echo "    Enable via: AWS Console -> ElastiCache -> Cluster -> Logs -> Enable Slow Log"
    echo "    Without slow logs, high-latency Redis commands cannot be identified from CloudWatch alone"
  else
    echo "  $CLUSTER_ID: log delivery configured: $LOG_CONFIG"
  fi
done

echo ""
echo "=== DAX CLUSTERS ==="
aws dax describe-clusters \
  --region "$REGION" \
  --query 'Clusters[*].{
    Name:ClusterName,
    Status:Status,
    Nodes:Nodes[*].{ID:NodeId,Status:NodeStatus,AZ:AvailabilityZone},
    NodeType:NodeType,
    TotalNodes:TotalNodes,
    ActiveNodes:ActiveNodes,
    Description:Description,
    ClusterEndpoint:ClusterDiscoveryEndpoint
  }' \
  --output json 2>/dev/null | tee "$EVIDENCE_DIR/dax-clusters.json" \
  || echo "No DAX clusters or insufficient permissions"

echo ""
echo "=== CLOUDWATCH: DAX METRICS ==="
DAX_CLUSTER_NAMES=$(aws dax describe-clusters \
  --region "$REGION" \
  --query 'Clusters[].ClusterName' \
  --output text 2>/dev/null | tr '\t' '\n' || echo "")

for DAX_NAME in $DAX_CLUSTER_NAMES; do
  echo "--- DAX cluster: $DAX_NAME ---"
  for METRIC in \
    ItemCacheHits ItemCacheMisses \
    QueryCacheHits QueryCacheMisses \
    ScanCacheHits ScanCacheMisses \
    TotalRequestCount \
    ErrorRequestCount FaultRequestCount FailedRequestCount \
    ThrottledRequestCount \
    CPUUtilization \
    NetworkBytesIn NetworkBytesOut; do

    DATA=$(aws cloudwatch get-metric-statistics \
      --namespace AWS/DAX \
      --metric-name "$METRIC" \
      --dimensions Name=ClusterName,Value="$DAX_NAME" \
      --start-time "$START" --end-time "$END" \
      --period 300 --statistics Sum Average \
      --region "$REGION" \
      --query 'sort_by(Datapoints,&Timestamp)[-1].[Average,Sum]' \
      --output text 2>/dev/null || echo "N/A")
    echo "  $METRIC: $DATA"
  done

  echo "  --- DAX computed item cache hit rate ---"
  python3 << PYEOF
import subprocess, json, os
def get_sum(metric, cluster):
    r = subprocess.run([
        'aws','cloudwatch','get-metric-statistics',
        '--namespace','AWS/DAX',
        '--metric-name', metric,
        '--dimensions', f'Name=ClusterName,Value={cluster}',
        '--start-time','$START','--end-time','$END',
        '--period', str(${ANALYSIS_HOURS} * 3600),
        '--statistics','Sum',
        '--region','$REGION',
        '--query','Datapoints[0].Sum',
        '--output','text'
    ], capture_output=True, text=True, env=os.environ)
    try: return float(r.stdout.strip())
    except: return None

for cache_type in [('Item','ItemCacheHits','ItemCacheMisses'), ('Query','QueryCacheHits','QueryCacheMisses')]:
    label, hits_metric, misses_metric = cache_type
    hits = get_sum(hits_metric, '$DAX_NAME')
    misses = get_sum(misses_metric, '$DAX_NAME')
    if hits is not None and misses is not None:
        total = hits + misses
        if total > 0:
            rate = hits / total * 100
            print(f'  {label} cache hit rate: {rate:.1f}% (hits={hits:.0f} misses={misses:.0f})')
            if rate < 90:
                print(f'  WARNING: DAX {label} hit rate below 90% - misses are consuming cluster resources without reducing DynamoDB load')
        else:
            print(f'  {label} cache: no traffic in window')
PYEOF

done
EOF
chmod +x ./diag-cache.sh
cat > ./prompt-cache.sh << 'EOF'
#!/bin/bash
set -euo pipefail
./diag-cache.sh | ./bedrock-ask.sh \
  "Analyse this ElastiCache and DAX cache diagnostic data for a production incident.

For ElastiCache Redis: identify the cache hit rate over the analysis window. A rate below 80% means the cache is not effectively absorbing load and most requests are falling through to the database; this will produce rising DatabaseConnections on RDS and increased DynamoDB read units simultaneously. A rate between 80% and 90% is a warning sign. Check whether a recent eviction spike caused the hit rate to drop: when the eviction policy removes keys to make room for new data, subsequent requests for those keys miss the cache and go to the database, which can trigger a cascade if the database tier is already under load.

Check MemoryFragmentationRatio for each cluster. A ratio above 1.5 indicates that the operating system has allocated significantly more memory to Redis than Redis is actually using for data, which means memory reclamation will be inefficient and FreeableMemory will appear lower than the actual data usage warrants. A ratio above 2.0 is severe and means activedefrag should be enabled. The fix is not to scale up the node: it is to enable active defragmentation.

Check EngineCPUUtilization, not CPUUtilization. Redis is single-threaded and EngineCPUUtilization reflects the load on the Redis thread specifically. A cluster showing low CPUUtilization but high EngineCPUUtilization is CPU-bound on the Redis process and will show latency under load even though the node appears otherwise healthy.

Check ReplicationLag on all replica nodes. Lag above 10 seconds means replicas are serving data that is more than 10 seconds stale. If the application reads from replicas, users may observe inconsistencies or miss recently written data. High replication lag under a write-heavy workload combined with high EngineCPUUtilization on the primary suggests the primary cannot keep up with replication while serving write commands.

Check resharding status. If any replication group has status modifying, a slot migration is in progress. Write latency will be elevated on migrating slots. Applications with timeouts under 500ms are at risk of errors during the migration window. Check whether the resharding event in the ElastiCache events log correlates with the incident start time.

For DAX: flag any cluster where the item or query cache hit rate is below 90%. Below this threshold the cluster is processing more pass-through DynamoDB requests than cache hits, which means it is adding latency (the DAX cluster is in the request path) without reducing DynamoDB load. Flag any ThrottledRequestCount above zero: DAX throttles when the request rate exceeds the node's capacity, and this will surface as latency spikes or errors in the application. Flag ErrorRequestCount and FaultRequestCount above zero, which indicate client errors and DAX internal errors respectively."
EOF
chmod +x ./prompt-cache.sh

9. Security and Compliance Sweep

Beyond the service-level checks above, a broad security sweep during an incident can reveal whether a configuration change introduced the failure or whether the incident has a security component. This is also valuable as a standalone health check.

cat > ./diag-security.sh << 'EOF'
#!/bin/bash
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"

echo "=== CLOUDTRAIL STATUS ==="
aws cloudtrail describe-trails \
  --region "$REGION" \
  --query 'trailList[*].{Name:Name,S3Bucket:S3BucketName,MultiRegion:IsMultiRegionTrail,LogValidation:LogFileValidationEnabled}' \
  --output json

aws cloudtrail get-trail-status \
  --name default \
  --region "$REGION" \
  --output json 2>/dev/null || echo "Default trail not found"

echo ""
echo "=== ACM CERTIFICATE STATUS ==="
aws acm list-certificates \
  --region "$REGION" \
  --query 'CertificateSummaryList[*].{ARN:CertificateArn,Domain:DomainName,Status:Status}' \
  --output json

echo ""
echo "=== WAF COVERAGE ON ALBS ==="
aws wafv2 list-web-acls \
  --scope REGIONAL \
  --region "$REGION" \
  --query 'WebACLs[*].{Name:Name,ARN:ARN}' \
  --output json

echo ""
echo "=== SSM MANAGED INSTANCE COVERAGE ==="
aws ssm describe-instance-information \
  --region "$REGION" \
  --query 'InstanceInformationList[*].{ID:InstanceId,PingStatus:PingStatus,AgentVersion:AgentVersion,PlatformType:PlatformType}' \
  --output json

echo ""
echo "=== EC2 INSTANCES WITHOUT SSM ==="
ALL_INSTANCES=$(aws ec2 describe-instances \
  --region "$REGION" \
  --query 'Reservations[*].Instances[*].InstanceId' \
  --output text)
SSM_INSTANCES=$(aws ssm describe-instance-information \
  --region "$REGION" \
  --query 'InstanceInformationList[*].InstanceId' \
  --output text)

python3 -c "
import sys
all_ids = set(sys.argv[1].split())
ssm_ids = set(sys.argv[2].split())
unmanaged = all_ids - ssm_ids
print('Unmanaged instances (no SSM):')
for i in sorted(unmanaged):
    print(f'  {i}')
print(f'Total: {len(all_ids)} EC2, {len(ssm_ids)} SSM-managed, {len(unmanaged)} unmanaged')
" "$ALL_INSTANCES" "$SSM_INSTANCES" 2>/dev/null || echo "Could not compute SSM coverage"

echo ""
echo "=== LAMBDA FUNCTION RUNTIMES ==="
aws lambda list-functions \
  --region "$REGION" \
  --query 'Functions[*].{Name:FunctionName,Runtime:Runtime,Memory:MemorySize,Timeout:Timeout,LastModified:LastModified}' \
  --output json

echo ""
echo "=== CLOUDWATCH ALARM STATE ==="
aws cloudwatch describe-alarms \
  --state-value ALARM \
  --region "$REGION" \
  --query 'MetricAlarms[*].{Name:AlarmName,State:StateValue,Metric:MetricName,Namespace:Namespace,Reason:StateReason}' \
  --output json

TOTAL=$(aws cloudwatch describe-alarms \
  --region "$REGION" \
  --query 'MetricAlarms | length(@)' \
  --output text 2>/dev/null || echo "0")
echo "Total alarms configured: $TOTAL"
EOF
chmod +x ./diag-security.sh
cat > ./prompt-security.sh << 'EOF'
#!/bin/bash
set -euo pipefail
./diag-security.sh | ./bedrock-ask.sh \
  "Analyse this security and compliance data for a production AWS account. Identify: CloudTrail trails that are not logging or do not have log file validation enabled, ACM certificates that are expired or expiring within 30 days which would cause HTTPS failures, internet-facing ALBs not protected by a WAF, EC2 instances not registered with SSM which means they cannot receive automated patches, Lambda functions running on deprecated runtimes approaching end of support, CloudWatch alarms currently in ALARM state, namespaces or services with no CloudWatch alarms configured indicating blind spots in observability. Any ALARM state CloudWatch alarm should be treated as a potential contributor to the current incident."
EOF
chmod +x ./prompt-security.sh

10. OpenSearch Service Diagnostics

OpenSearch Service failures during a production incident manifest in ways that are easy to misattribute. Application errors reading as connection timeouts, search returning partial or stale results, indexing falling behind to the point where dashboards show data hours old, and write operations blocked entirely while the cluster appears to be running. The cluster status colours published through the CloudWatch ES namespace are the starting point, but they describe symptoms rather than causes. A red cluster means at least one primary shard is unassigned and some data is unavailable. A yellow cluster means all primary shards are allocated but one or more replica shards are unassigned, which is operationally safe but leaves no redundancy for the next node failure. The causes of these states range across node failures, JVM heap exhaustion, storage pressure, shard imbalance, and index lifecycle misconfigurations.

The JVM heap limit is particularly important: when JVMMemoryPressure exceeds 92% for 30 minutes, OpenSearch Service activates a write-block protection mechanism. All write operations then fail with ClusterBlockException until heap pressure drops below 88% and holds there for five minutes. Applications experience this as a sudden transition from normal operation to total write failure with no gradual degradation, because the cluster was absorbing all writes normally up to the moment the threshold was crossed.

cat > ./diag-opensearch.sh << 'EOF'
#!/bin/bash
# diag-opensearch.sh: Collect OpenSearch Service domain health, cluster metrics,
# shard allocation state, and slow log configuration for Bedrock analysis.
# All evidence is written to local files before being printed for piping.
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
ANALYSIS_HOURS="${ANALYSIS_HOURS:-24}"
EVIDENCE_DIR="${EVIDENCE_DIR:-./evidence}"
mkdir -p "$EVIDENCE_DIR"

START=$(python3 -c "from datetime import datetime, timedelta; print((datetime.utcnow() - timedelta(hours=${ANALYSIS_HOURS})).strftime('%Y-%m-%dT%H:%M:%SZ'))")
END=$(python3 -c "from datetime import datetime; print(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))")

echo "=== OPENSEARCH DOMAINS ==="
aws opensearch list-domain-names \
  --region "$REGION" \
  --output json 2>/dev/null | tee "$EVIDENCE_DIR/opensearch-domains.json" \
  || { echo "OpenSearch not in use or insufficient permissions"; exit 0; }

DOMAIN_NAMES=$(aws opensearch list-domain-names \
  --region "$REGION" \
  --query 'DomainNames[].DomainName' \
  --output text 2>/dev/null || echo "")

if [ -z "$DOMAIN_NAMES" ]; then
  echo "No OpenSearch domains found in $REGION"
  exit 0
fi

for DOMAIN in $DOMAIN_NAMES; do
  echo ""
  echo "=== DOMAIN: $DOMAIN ==="

  echo "--- Configuration ---"
  aws opensearch describe-domain \
    --domain-name "$DOMAIN" \
    --region "$REGION" \
    --query 'DomainStatus.{
      ARN:ARN,
      EngineVersion:EngineVersion,
      InstanceType:ClusterConfig.InstanceType,
      InstanceCount:ClusterConfig.InstanceCount,
      DedicatedMaster:ClusterConfig.DedicatedMasterEnabled,
      MasterType:ClusterConfig.DedicatedMasterType,
      MasterCount:ClusterConfig.DedicatedMasterCount,
      MultiAZ:ClusterConfig.ZoneAwarenessEnabled,
      WarmEnabled:ClusterConfig.WarmEnabled,
      StorageType:EBSOptions.VolumeType,
      StorageGB:EBSOptions.VolumeSize,
      StorageIOPS:EBSOptions.Iops,
      Endpoint:Endpoint,
      Processing:Processing,
      UpgradeProcessing:UpgradeProcessing,
      AccessPolicies:AccessPolicies,
      EncryptionAtRest:EncryptionAtRestOptions.Enabled,
      NodeToNode:NodeToNodeEncryptionOptions.Enabled,
      LogPublishing:LogPublishingOptions
    }' \
    --output json 2>/dev/null | tee "$EVIDENCE_DIR/opensearch-domain-${DOMAIN}.json"

  echo ""
  echo "--- CloudWatch Health Metrics (last ${ANALYSIS_HOURS}h) ---"
  for METRIC in ClusterStatus.red ClusterStatus.yellow ClusterStatus.green \
                ClusterIndexWritesBlocked Nodes FreeStorageSpace \
                JVMMemoryPressure MasterJVMMemoryPressure \
                CPUUtilization MasterCPUUtilization \
                SearchLatency IndexingLatency \
                SearchRate IndexingRate \
                AutomatedSnapshotFailure \
                SysMemoryUtilization; do
    SAFE_METRIC=$(echo "$METRIC" | tr '.' '_')
    VALUE=$(aws cloudwatch get-metric-statistics \
      --namespace AWS/ES \
      --metric-name "$METRIC" \
      --dimensions Name=DomainName,Value="$DOMAIN" Name=ClientId,Value="$(aws sts get-caller-identity --query Account --output text)" \
      --start-time "$START" --end-time "$END" \
      --period 300 --statistics Average Maximum Minimum \
      --region "$REGION" \
      --query 'sort_by(Datapoints,&Timestamp)[-6:].[Timestamp,Average,Maximum,Minimum]' \
      --output json 2>/dev/null | tee "$EVIDENCE_DIR/opensearch-${DOMAIN}-${SAFE_METRIC}.json")
    LAST=$(echo "$VALUE" | python3 -c "import json,sys; d=json.load(sys.stdin); print(d[-1] if d else 'no data')" 2>/dev/null || echo "no data")
    echo "  $METRIC: $LAST"
  done

  echo ""
  echo "--- CloudWatch Alarm State for this Domain ---"
  aws cloudwatch describe-alarms \
    --region "$REGION" \
    --query "MetricAlarms[?contains(Dimensions[?Name=='DomainName'].Value[], '$DOMAIN')].{Name:AlarmName,State:StateValue,Metric:MetricName,Reason:StateReason}" \
    --output json 2>/dev/null | tee "$EVIDENCE_DIR/opensearch-alarms-${DOMAIN}.json"

  echo ""
  echo "--- Slow Log Configuration ---"
  aws opensearch describe-domain \
    --domain-name "$DOMAIN" \
    --region "$REGION" \
    --query 'DomainStatus.LogPublishingOptions' \
    --output json 2>/dev/null

  echo ""
  echo "--- Recent CloudTrail Events for this Domain ---"
  aws cloudtrail lookup-events \
    --lookup-attributes AttributeKey=ResourceName,AttributeValue="$DOMAIN" \
    --start-time "$START" \
    --region "$REGION" \
    --max-results 20 \
    --query 'Events[*].{Time:EventTime,Event:EventName,User:Username,Source:EventSource}' \
    --output json 2>/dev/null | tee "$EVIDENCE_DIR/opensearch-cloudtrail-${DOMAIN}.json"

  echo ""
  echo "--- Direct Cluster Health (if endpoint is accessible) ---"
  ENDPOINT=$(aws opensearch describe-domain \
    --domain-name "$DOMAIN" \
    --region "$REGION" \
    --query 'DomainStatus.Endpoint' \
    --output text 2>/dev/null || echo "")

  if [ -n "$ENDPOINT" ] && [ "$ENDPOINT" != "None" ]; then
    echo "Attempting cluster health API call to https://${ENDPOINT}..."
    # --fail causes curl to exit non-zero on HTTP 4xx/5xx, preventing error JSON
    # being silently treated as valid health data by Bedrock
    CLUSTER_HEALTH=$(curl -s --fail --max-time 10 \
      -H "Content-Type: application/json" \
      "https://${ENDPOINT}/_cluster/health?pretty" 2>/dev/null) \
      || CLUSTER_HEALTH='{"status":"API_UNREACHABLE","reason":"curl failed - VPC endpoint, auth required, or cluster unavailable"}'
    echo "$CLUSTER_HEALTH" | tee "$EVIDENCE_DIR/opensearch-cluster-health-${DOMAIN}.json"

    echo ""
    echo "Attempting shard allocation explanation..."
    curl -s --fail --max-time 10 \
      -H "Content-Type: application/json" \
      "https://${ENDPOINT}/_cluster/allocation/explain?pretty" 2>/dev/null \
      | tee "$EVIDENCE_DIR/opensearch-shard-explain-${DOMAIN}.json" \
      || echo '{"status":"API_UNREACHABLE","reason":"auth required or cluster healthy (no unassigned shards)"}' \
      | tee "$EVIDENCE_DIR/opensearch-shard-explain-${DOMAIN}.json"

    echo ""
    echo "Attempting pending tasks..."
    curl -s --fail --max-time 10 \
      "https://${ENDPOINT}/_cluster/pending_tasks?pretty" 2>/dev/null \
      | tee "$EVIDENCE_DIR/opensearch-pending-tasks-${DOMAIN}.json" \
      || echo '{"status":"API_UNREACHABLE"}' \
      | tee "$EVIDENCE_DIR/opensearch-pending-tasks-${DOMAIN}.json"

    echo ""
    echo "Attempting index-level shard counts..."
    curl -s --fail --max-time 10 \
      "https://${ENDPOINT}/_cat/indices?v&h=index,health,status,pri,rep,docs.count,store.size,pri.store.size&s=health:desc" 2>/dev/null \
      | tee "$EVIDENCE_DIR/opensearch-indices-${DOMAIN}.txt" \
      || echo "API_UNREACHABLE: index listing requires auth or VPC access" \
      | tee "$EVIDENCE_DIR/opensearch-indices-${DOMAIN}.txt"

    echo ""
    echo "Attempting node stats (JVM, heap, CPU)..."
    curl -s --fail --max-time 15 \
      "https://${ENDPOINT}/_nodes/stats/jvm,os,process?pretty" 2>/dev/null \
      | python3 -c "
import json, sys
try:
    d = json.load(sys.stdin)
    nodes = d.get('nodes', {})
    for nid, n in nodes.items():
        name = n.get('name', nid[:8])
        jvm = n.get('jvm', {}).get('mem', {})
        heap_used = jvm.get('heap_used_in_bytes', 0)
        heap_max = jvm.get('heap_max_in_bytes', 1)
        heap_pct = round(100 * heap_used / heap_max, 1) if heap_max else 0
        cpu = n.get('os', {}).get('cpu', {}).get('percent', 'N/A')
        gc_count = sum(gc.get('collection_count', 0) for gc in n.get('jvm', {}).get('gc', {}).get('collectors', {}).values())
        print(f'  Node {name}: heap={heap_pct}% cpu={cpu}% gc_collections={gc_count}')
except Exception as e:
    print(f'  node stats parse error: {e}')
" 2>/dev/null || echo "Node stats unreachable (auth required or VPC access needed)"

  else
    echo "Domain endpoint not publicly accessible. CloudWatch metrics above are the primary evidence source."
    echo "For direct API access, ensure the VPC security group allows inbound 443 from the diagnostic host,"
    echo "or run this script from within the same VPC."
  fi

  echo ""
  echo "--- Slow Logs (if configured) ---"
  SLOW_LOG_GROUP=$(aws opensearch describe-domain \
    --domain-name "$DOMAIN" \
    --region "$REGION" \
    --query 'DomainStatus.LogPublishingOptions.SEARCH_SLOW_LOGS.CloudWatchLogsLogGroupArn' \
    --output text 2>/dev/null | sed 's|.*log-group:||' | sed 's|:.*||') || SLOW_LOG_GROUP=""

  if [ -n "$SLOW_LOG_GROUP" ] && [ "$SLOW_LOG_GROUP" != "None" ]; then
    echo "Querying slow search logs from: $SLOW_LOG_GROUP"
    START_MS=$(python3 -c "import time; print(int((time.time() - ${ANALYSIS_HOURS}*3600) * 1000))")
    END_MS=$(python3 -c "import time; print(int(time.time() * 1000))")
    Q_ID=$(aws logs start-query \
      --log-group-name "$SLOW_LOG_GROUP" \
      --start-time "$START_MS" --end-time "$END_MS" \
      --query-string 'fields @timestamp, @message
        | filter @message like /took\[/
        | parse @message "took[*]" as took_ms
        | sort @timestamp desc
        | limit 30' \
      --region "$REGION" --query 'queryId' --output text 2>/dev/null) || Q_ID=""
    if [ -n "$Q_ID" ]; then
      sleep 8
      aws logs get-query-results --query-id "$Q_ID" --region "$REGION" --output json 2>/dev/null \
        | tee "$EVIDENCE_DIR/opensearch-slow-logs-${DOMAIN}.json" \
        || echo '{"results":[],"status":"GET_RESULTS_FAILED"}' \
        | tee "$EVIDENCE_DIR/opensearch-slow-logs-${DOMAIN}.json"
    else
      echo "[WARN] Could not start slow log query for $DOMAIN - log group may not exist yet or permissions are missing" >&2
    fi
  else
    echo "Search slow logs not configured for this domain."
    echo "Enable via: aws opensearch update-domain-config --domain-name $DOMAIN \\"
    echo "  --log-publishing-options 'SEARCH_SLOW_LOGS={CloudWatchLogsLogGroupArn=arn:aws:logs:REGION:ACCOUNT:log-group:/aws/opensearch/$DOMAIN/search-slow,Enabled=true}'"
  fi

done
EOF
chmod +x ./diag-opensearch.sh
cat > ./prompt-opensearch.sh << 'EOF'
#!/bin/bash
set -euo pipefail
./diag-opensearch.sh | ./bedrock-ask.sh \
  "Analyse this Amazon OpenSearch Service data for a production incident.

Cluster status: a red status means at least one primary shard is unassigned and data is partially unavailable. Yellow means all primary shards are allocated but replica shards are missing, leaving no redundancy for further node failures. Green means fully healthy. Identify the current status and whether it has been in a degraded state for the full analysis window or whether it transitioned recently, as the transition time often correlates with a configuration change visible in the CloudTrail events.

JVM heap: when JVMMemoryPressure exceeds 92% for 30 minutes, OpenSearch activates write-block protection. ClusterIndexWritesBlocked going to 1 is direct evidence this has happened. Applications see this as all write operations failing with ClusterBlockException while reads continue. The root causes of sustained high heap are field data from high-cardinality aggregation queries (memory is not released until the segment is evicted), a large number of shards (each shard has a fixed JVM overhead regardless of size), open scroll contexts, or insufficient instance memory for the data volume. If node stats are available, look at heap percentages per node, as an imbalanced heap across nodes indicates shard distribution is not even.

Storage: FreeStorageSpace below 25% of total domain storage triggers shard allocation failure. OpenSearch will not allocate new shards to a node with less than 5% free space. Flag any node approaching this threshold.

Performance: SearchLatency above 500ms on a production cluster handling user-facing queries indicates either a resource constraint or inefficient queries. IndexingLatency above 30ms suggests indexing is falling behind, which is often caused by merge pressure from too many small segments. SearchRate and IndexingRate dropping suddenly without a corresponding drop in traffic indicates the cluster is throttling.

Slow logs: if available, identify the top queries by execution time. A slow search query often indicates a missing index on a filtered field or a high-cardinality aggregation without a filter to reduce the document set.

Configuration: flag domains without zone awareness enabled (no multi-AZ redundancy), without dedicated master nodes (master instability under load), without encryption at rest or node-to-node encryption (security posture), and without slow log publishing configured (makes this query impossible without the direct API connection)."
EOF
chmod +x ./prompt-opensearch.sh

11. Storage and IOPS Diagnostics

Storage throttling is one of the most reliably misdiagnosed production failure modes. The symptoms look like application slowness, database timeouts, or pod OOM events. The underlying cause is that EBS IOPS or throughput limits are being exceeded at either the volume level or the instance level, and the I/O queue is backing up. There are two independent bottleneck layers and both must be checked, because provisioning more IOPS on a volume has no effect if the instance-level EBS bandwidth cap is the constraint.

At the volume level, gp2 volumes allocate 3 IOPS per GiB of size up to a burst ceiling of 3,000 IOPS. A 250 GB gp2 volume has a sustained baseline of only 750 IOPS. Under sustained load, once the burst credit pool exhausts, the volume drops to that baseline and every I/O operation above it queues. The application experiences this as exponentially increasing latency with no error, because the operations eventually complete. gp3 volumes decouple IOPS from size and provide a flat 3,000 IOPS baseline regardless of size, but if you provision additional IOPS on gp3 and the instance type cannot deliver the bandwidth, those provisioned IOPS are wasted.

At the instance level, each instance type has a documented EBS-optimized bandwidth ceiling that applies to aggregate I/O across all attached volumes. An m5.large sustains 4,750 Mbps baseline EBS bandwidth and can burst to 10,000 Mbps for up to 30 minutes before reverting. If your workload requires sustained performance above the baseline, you are operating in borrowed time on every instance restart, and the performance cliff hits at an unpredictable moment after the burst window closes.

cat > ./diag-storage-iops.sh << 'EOF'
#!/bin/bash
# diag-storage-iops.sh: Collect EC2 and EBS IOPS, throughput, burst credit, and queue depth
# data to identify storage bottlenecks. Compares provisioned volume IOPS against instance-level
# EBS bandwidth caps to surface mismatches that cause throttling under sustained load.
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
INSTANCE_FILTER="${1:-}"

ANALYSIS_HOURS="${ANALYSIS_HOURS:-24}"
START=$(python3 -c "from datetime import datetime, timedelta; print((datetime.utcnow() - timedelta(hours=${ANALYSIS_HOURS})).strftime('%Y-%m-%dT%H:%M:%SZ'))")
END=$(python3 -c "from datetime import datetime; print(datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'))")

echo "=== EC2 INSTANCES AND ATTACHED VOLUMES ==="
INSTANCES=$(aws ec2 describe-instances \
  --region "$REGION" \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[*].Instances[*].{
    ID:InstanceId,
    Type:InstanceType,
    AZ:Placement.AvailabilityZone,
    Name:Tags[?Key==`Name`]|[0].Value,
    Volumes:BlockDeviceMappings[*].{Dev:DeviceName,VolumeId:Ebs.VolumeId}
  }' \
  --output json)
echo "$INSTANCES"

echo ""
echo "=== EBS VOLUME DETAILS AND PROVISIONED PERFORMANCE ==="
VOLUME_IDS=$(aws ec2 describe-instances \
  --region "$REGION" \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[*].Instances[*].BlockDeviceMappings[*].Ebs.VolumeId' \
  --output text | tr '\t' '\n' | sort -u)

for VOL_ID in $VOLUME_IDS; do
  aws ec2 describe-volumes \
    --volume-ids "$VOL_ID" \
    --region "$REGION" \
    --query 'Volumes[0].{
      ID:VolumeId,
      Type:VolumeType,
      SizeGB:Size,
      ProvisionedIOPS:Iops,
      Throughput:Throughput,
      State:State,
      Encrypted:Encrypted,
      MultiAttach:MultiAttachEnabled
    }' \
    --output json 2>/dev/null
done

echo ""
echo "=== INSTANCE-LEVEL EBS BANDWIDTH CAPS ==="
# Pull instance type info from the EC2 API for running instances
INSTANCE_TYPES=$(aws ec2 describe-instances \
  --region "$REGION" \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[*].Instances[*].InstanceType' \
  --output text | tr '\t' '\n' | sort -u)

for ITYPE in $INSTANCE_TYPES; do
  echo "--- $ITYPE ---"
  aws ec2 describe-instance-types \
    --instance-types "$ITYPE" \
    --region "$REGION" \
    --query 'InstanceTypes[0].{
      Type:InstanceType,
      vCPUs:VCpuInfo.DefaultVCpus,
      MemoryMiB:MemoryInfo.SizeInMiB,
      EBSOptimized:EbsInfo.EbsOptimizedSupport,
      BaselineBandwidthMbps:EbsInfo.EbsOptimizedInfo.BaselineBandwidthInMbps,
      MaxBandwidthMbps:EbsInfo.EbsOptimizedInfo.MaximumBandwidthInMbps,
      BaselineIOPS:EbsInfo.EbsOptimizedInfo.BaselineIops,
      MaxIOPS:EbsInfo.EbsOptimizedInfo.MaximumIops,
      BaselineThroughputMBps:EbsInfo.EbsOptimizedInfo.BaselineThroughputInMBps,
      MaxThroughputMBps:EbsInfo.EbsOptimizedInfo.MaximumThroughputInMBps,
      NetworkBandwidthGbps:NetworkInfo.NetworkCards[0].BaselineBandwidthInGbps
    }' \
    --output json 2>/dev/null
done

echo ""
echo "=== CLOUDWATCH: EBS VOLUME METRICS (last 30 min) ==="
for VOL_ID in $VOLUME_IDS; do
  echo "--- Volume: $VOL_ID ---"
  for METRIC in VolumeReadOps VolumeWriteOps VolumeReadBytes VolumeWriteBytes VolumeQueueLength VolumeThroughputPercentage BurstBalance; do
    VALUE=$(aws cloudwatch get-metric-statistics \
      --namespace AWS/EBS \
      --metric-name "$METRIC" \
      --dimensions Name=VolumeId,Value="$VOL_ID" \
      --start-time "$START" --end-time "$END" \
      --period 300 --statistics Average Maximum Sum \
      --region "$REGION" \
      --query 'sort_by(Datapoints,&Timestamp)[-1].[Average,Maximum,Sum]' \
      --output text 2>/dev/null || echo "N/A")
    echo "  $METRIC: $VALUE"
  done
done

echo ""
echo "=== CLOUDWATCH: INSTANCE-LEVEL EBS THROTTLE CHECKS ==="
INSTANCE_IDS=$(aws ec2 describe-instances \
  --region "$REGION" \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[*].Instances[*].InstanceId' \
  --output text | tr '\t' '\n')

for INST_ID in $INSTANCE_IDS; do
  echo "--- Instance: $INST_ID ---"
  for METRIC in EBSReadOps EBSWriteOps EBSReadBytes EBSWriteBytes EBSIOBalance EBSByteBalance; do
    VALUE=$(aws cloudwatch get-metric-statistics \
      --namespace AWS/EC2 \
      --metric-name "$METRIC" \
      --dimensions Name=InstanceId,Value="$INST_ID" \
      --start-time "$START" --end-time "$END" \
      --period 300 --statistics Average Minimum \
      --region "$REGION" \
      --query 'sort_by(Datapoints,&Timestamp)[-1].[Average,Minimum]' \
      --output text 2>/dev/null || echo "N/A")
    echo "  $METRIC: $VALUE"
  done
  for THROTTLE_METRIC in InstanceEBSIOPSExceededCheck InstanceEBSThroughputExceededCheck VolumeIOPSExceededCheck VolumeThroughputExceededCheck; do
    VALUE=$(aws cloudwatch get-metric-statistics \
      --namespace AWS/EBS \
      --metric-name "$THROTTLE_METRIC" \
      --dimensions Name=InstanceId,Value="$INST_ID" \
      --start-time "$START" --end-time "$END" \
      --period 300 --statistics Maximum \
      --region "$REGION" \
      --query 'sort_by(Datapoints,&Timestamp)[?Maximum==`1`].[Timestamp,Maximum]' \
      --output text 2>/dev/null || echo "")
    if [ -n "$VALUE" ]; then
      echo "  *** THROTTLE DETECTED $THROTTLE_METRIC: $VALUE"
    fi
  done
done

echo ""
echo "=== RDS STORAGE: IOPS AND FREEABLE SPACE ==="
aws rds describe-db-instances \
  --region "$REGION" \
  --query 'DBInstances[*].{
    ID:DBInstanceIdentifier,
    Class:DBInstanceClass,
    Engine:Engine,
    StorageType:StorageType,
    AllocatedGB:AllocatedStorage,
    ProvisionedIOPS:Iops,
    StorageAutoscaling:MaxAllocatedStorage,
    MultiAZ:MultiAZ
  }' \
  --output json

RDS_INSTANCES=$(aws rds describe-db-instances \
  --region "$REGION" \
  --query 'DBInstances[].DBInstanceIdentifier' \
  --output text)

for RDS_ID in $RDS_INSTANCES; do
  echo ""
  echo "--- RDS $RDS_ID IOPS/storage metrics ---"
  for METRIC in FreeStorageSpace ReadIOPS WriteIOPS ReadLatency WriteLatency DiskQueueDepth ReadThroughput WriteThroughput; do
    VALUE=$(aws cloudwatch get-metric-statistics \
      --namespace AWS/RDS \
      --metric-name "$METRIC" \
      --dimensions Name=DBInstanceIdentifier,Value="$RDS_ID" \
      --start-time "$START" --end-time "$END" \
      --period 300 --statistics Average Maximum \
      --region "$REGION" \
      --query 'sort_by(Datapoints,&Timestamp)[-1].[Average,Maximum]' \
      --output text 2>/dev/null || echo "N/A")
    echo "  $METRIC: avg=$(echo $VALUE | awk '{print $1}') max=$(echo $VALUE | awk '{print $2}')"
  done
done
EOF
chmod +x ./diag-storage-iops.sh
cat > ./prompt-storage-iops.sh << 'EOF'
#!/bin/bash
set -euo pipefail
./diag-storage-iops.sh | ./bedrock-ask.sh \
  "Analyse this EC2 and EBS storage performance data for a production incident where applications may be experiencing slow I/O, database timeouts, or latency spikes.

There are two independent throttle points that must both be checked. The first is the volume level: for gp2 volumes, identify the BurstBalance metric and flag any volume where it is below 20%, because once it reaches 0% the volume drops from its burst ceiling of 3,000 IOPS to its sustained baseline of 3 IOPS per GiB, so a 250 GB gp2 volume that has exhausted burst credits is limited to 750 IOPS regardless of demand. For gp3 volumes, compare the provisioned IOPS against what the CloudWatch ReadOps and WriteOps metrics show is actually being consumed, and flag any volume where the sum approaches the provisioned ceiling. For any volume showing VolumeQueueLength above 1 sustained for more than a few minutes, the volume cannot service I/O as fast as it arrives and latency will grow unboundedly until load drops.

The second throttle point is the instance level: each instance type has a BaselineIOPS and MaxIOPS for its EBS-optimized connection, and an EBSIOBalance credit pool for burstable instance types. An InstanceEBSIOPSExceededCheck or InstanceEBSThroughputExceededCheck value of 1 means the instance's aggregate I/O across all volumes has exceeded the instance-level ceiling, not just a single volume, which means upgrading the volume IOPS will not help; the instance type must be changed.

Critically, look for the mismatch pattern: volumes provisioned with high IOPS attached to an instance type whose MaxIOPS or MaxThroughputMBps is lower than the sum of provisioned IOPS across all attached volumes. This is the most common IOPS misconfiguration and is invisible until sustained load hits the instance ceiling. For example, attaching two io2 volumes each provisioned at 10,000 IOPS to an m5.large with a MaxIOPS of 16,000 means the instance is the bottleneck at any load above 16,000 aggregate IOPS even though the volumes can theoretically supply 20,000.

For RDS instances: flag DiskQueueDepth above 1 as an active storage bottleneck. Flag any instance where ReadLatency or WriteLatency has increased significantly, as this often precedes visible application latency by several minutes. Flag RDS instances without storage autoscaling enabled if FreeStorageSpace is below 20% of allocated storage."
EOF
chmod +x ./prompt-storage-iops.sh

12. Observability Readiness: Logging Gap Audit

The single most reliable predictor of a slow incident resolution is missing logs. When VPC flow logs are not enabled, you cannot confirm what traffic is actually flowing and where it is being rejected. When Route 53 Resolver query logs are not enabled, you cannot see NXDOMAIN responses and cannot trace DNS failure to a specific query name or source instance. When CloudTrail is not configured with data events, you cannot see which IAM principal made the change that introduced the failure. When RDS slow query logs are not exported to CloudWatch, you cannot correlate a database performance problem with specific SQL statements without connecting directly to the instance.

This section provides a dedicated script that audits all of these logging gaps before an incident happens and reports clearly on what is missing and how to enable it. It is also useful to run at the start of an incident to understand which diagnostic tools are available and which are not.

cat > ./diag-observability-gaps.sh << 'EOF'
#!/bin/bash
# diag-observability-gaps.sh: Audit observability coverage across VPC flow logs,
# Route 53 Resolver query logs, CloudTrail configuration, RDS log exports,
# CloudWatch alarms, and ELB access logs. Reports gaps and provides enable commands.
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

echo "=== OBSERVABILITY READINESS AUDIT ==="
echo "Account: $ACCOUNT_ID  Region: $REGION  $(date -u)"
echo ""

echo "=== 1. VPC FLOW LOGS ==="
VPC_IDS=$(aws ec2 describe-vpcs --region "$REGION" \
  --query 'Vpcs[].VpcId' --output text)
FLOW_LOG_VPCS=$(aws ec2 describe-flow-logs --region "$REGION" \
  --query 'FlowLogs[?ResourceType==`VPC`].ResourceId' --output text)

for VPC in $VPC_IDS; do
  if echo "$FLOW_LOG_VPCS" | grep -q "$VPC"; then
    DEST=$(aws ec2 describe-flow-logs --region "$REGION" \
      --filter "Name=resource-id,Values=$VPC" \
      --query 'FlowLogs[0].{Dest:LogDestinationType,Group:LogGroupName,Format:LogFormat}' \
      --output json 2>/dev/null)
    echo "  [OK]  $VPC: flow logs enabled - $DEST"
    # Check if using custom format with tcp-flags
    FORMAT=$(aws ec2 describe-flow-logs --region "$REGION" \
      --filter "Name=resource-id,Values=$VPC" \
      --query 'FlowLogs[0].LogFormat' --output text 2>/dev/null || echo "")
    if echo "$FORMAT" | grep -q "tcp-flags"; then
      echo "        Custom format with tcp-flags: YES (enables zero-window analysis)"
    else
      echo "        Custom format with tcp-flags: NO  (upgrade recommended for TCP signal analysis)"
      echo "        To enable: recreate flow log with format including \${tcp-flags}"
    fi
  else
    echo "  [MISSING] $VPC: NO FLOW LOGS"
    echo "    Enable: aws ec2 create-flow-logs \\"
    echo "      --resource-type VPC --resource-ids $VPC \\"
    echo "      --traffic-type ALL \\"
    echo "      --log-destination-type cloud-watch-logs \\"
    echo "      --log-group-name /aws/vpc/flowlogs \\"
    echo "      --deliver-logs-permission-arn arn:aws:iam::${ACCOUNT_ID}:role/FlowLogsRole \\"
    echo "      --log-format '\$(version) \$(account-id) \$(interface-id) \$(srcaddr) \$(dstaddr) \$(srcport) \$(dstport) \$(protocol) \$(packets) \$(bytes) \$(start) \$(end) \$(action) \$(log-status) \$(tcp-flags)' \\"
    echo "      --region $REGION"
  fi
done

echo ""
echo "=== 2. ROUTE 53 RESOLVER QUERY LOGS ==="
RESOLVER_LOG_CONFIGS=$(aws route53resolver list-resolver-query-log-configs \
  --region "$REGION" \
  --query 'ResolverQueryLogConfigs[].{ID:Id,Name:Name,Dest:DestinationArn,Status:Status}' \
  --output json 2>/dev/null || echo "[]")
ASSOC_VPCS=$(aws route53resolver list-resolver-query-log-config-associations \
  --region "$REGION" \
  --query 'ResolverQueryLogConfigAssociations[].ResourceId' \
  --output text 2>/dev/null || echo "")

echo "  Query log configs: $RESOLVER_LOG_CONFIGS"
echo ""
for VPC in $VPC_IDS; do
  if echo "$ASSOC_VPCS" | grep -q "$VPC"; then
    echo "  [OK]  $VPC: DNS query logging associated"
  else
    echo "  [MISSING] $VPC: NO DNS QUERY LOGGING"
    echo "    Without DNS query logs, NXDOMAIN failures and DNS-related outages cannot be traced to specific query names or source instances."
    echo "    Enable:"
    echo "    1. Create log group: aws logs create-log-group --log-group-name /aws/route53resolver/query-logs --region $REGION"
    echo "    2. Create config:  aws route53resolver create-resolver-query-log-config \\"
    echo "         --name prod-dns-query-logs \\"
    echo "         --destination-arn arn:aws:logs:${REGION}:${ACCOUNT_ID}:log-group:/aws/route53resolver/query-logs \\"
    echo "         --region $REGION"
    echo "    3. Associate VPC: aws route53resolver associate-resolver-query-log-config \\"
    echo "         --resolver-query-log-config-id <config-id-from-step-2> \\"
    echo "         --resource-id $VPC \\"
    echo "         --region $REGION"
  fi
done

echo ""
echo "=== 3. CLOUDTRAIL ==="
TRAILS=$(aws cloudtrail describe-trails --region "$REGION" --output json 2>/dev/null) || TRAILS='{"trailList":[]}'
if [ -z "$TRAILS" ] || [ "$TRAILS" = "null" ]; then
  TRAILS='{"trailList":[]}'
fi
echo "$TRAILS" | python3 -c "
import json, sys
trails = json.load(sys.stdin).get('trailList', [])
if not trails:
    print('  [CRITICAL] No CloudTrail trails found. API changes are not logged.')
    print('    Without CloudTrail, you cannot determine which IAM principal made')
    print('    configuration changes that may have introduced the incident.')
else:
    for t in trails:
        name = t.get('Name','')
        multi = t.get('IsMultiRegionTrail', False)
        validation = t.get('LogFileValidationEnabled', False)
        mgmt = t.get('IncludeManagementEvents', True)
        data_events = bool(t.get('EventSelectors'))
        print(f'  Trail: {name}')
        print(f'    Multi-region: {multi}  Log validation: {validation}')
        if not multi:
            print('    [WARN] Single-region trail misses global service events (IAM, STS)')
        if not validation:
            print('    [WARN] Log file validation disabled - tampered logs cannot be detected')
        print(f'    Data events configured: {data_events}')
        if not data_events:
            print('    [INFO] No data events - S3 object-level and Lambda invoke events not logged')
            print('           To add S3 data events: aws cloudtrail put-event-selectors \\')
            print(f'             --trail-name {name} \\')
            print(\"             --event-selectors '[{\\\"ReadWriteType\\\": \\\"All\\\", \\\"IncludeManagementEvents\\\": true, \\\"DataResources\\\": [{\\\"Type\\\": \\\"AWS::S3::Object\\\", \\\"Values\\\": [\\\"arn:aws:s3:::*/*\\\"]}]}]'\")
" 2>/dev/null

echo ""
echo "=== 4. CLOUDTRAIL: IS LOGGING ACTIVE? ==="
TRAIL_NAMES=$(aws cloudtrail describe-trails --region "$REGION" \
  --query 'trailList[].Name' --output text 2>/dev/null) || TRAIL_NAMES=""
if [ -z "$TRAIL_NAMES" ]; then
  echo "  No trails found or describe-trails failed"
fi
for TRAIL in $TRAIL_NAMES; do
  STATUS=$(aws cloudtrail get-trail-status --name "$TRAIL" --region "$REGION" \
    --query '{IsLogging:IsLogging,LatestDelivery:LatestDeliveryTime,LatestError:LatestDeliveryError}' \
    --output json 2>/dev/null || echo '{"IsLogging":false}')
  echo "  $TRAIL: $STATUS"
done

echo ""
echo "=== 5. RDS LOG EXPORTS TO CLOUDWATCH ==="
aws rds describe-db-instances \
  --region "$REGION" \
  --query 'DBInstances[*].{ID:DBInstanceIdentifier,Engine:Engine,LogExports:EnabledCloudwatchLogsExports}' \
  --output json | python3 -c "
import json, sys
instances = json.load(sys.stdin)
for inst in instances:
    id_ = inst['ID']
    engine = inst['Engine']
    exports = inst.get('LogExports') or []
    print(f'  {id_} ({engine}): exports={exports}')
    if 'postgresql' in engine or 'aurora-postgresql' in engine:
        missing = [e for e in ['postgresql', 'upgrade'] if e not in exports]
        if missing:
            print(f'    [MISSING] {missing} - slow query and error logs not visible in CloudWatch')
    elif 'mysql' in engine or 'aurora' in engine:
        missing = [e for e in ['slowquery', 'error', 'general'] if e not in exports]
        if missing:
            print(f'    [MISSING] {missing} - slow query analysis requires slowquery export')
    if not exports:
        print('    Enable via: aws rds modify-db-instance --db-instance-identifier ' + id_ + ' \\')
        print('      --cloudwatch-logs-export-configuration \\')
        print('      EnableLogTypes=[slowquery,error,general,audit] --apply-immediately')
"

echo ""
echo "=== 6. ELB ACCESS LOGS ==="
LOAD_BALANCER_ARNS=$(aws elbv2 describe-load-balancers \
  --region "$REGION" \
  --query 'LoadBalancers[].LoadBalancerArn' \
  --output text)
for LB_ARN in $LOAD_BALANCER_ARNS; do
  LB_NAME=$(aws elbv2 describe-load-balancers --load-balancer-arns "$LB_ARN" \
    --query 'LoadBalancers[0].LoadBalancerName' --output text --region "$REGION")
  ACCESS_LOGS=$(aws elbv2 describe-load-balancer-attributes \
    --load-balancer-arn "$LB_ARN" \
    --region "$REGION" \
    --query 'Attributes[?Key==`access_logs.s3.enabled`].Value' \
    --output text 2>/dev/null || echo "false")
  if [ "$ACCESS_LOGS" = "true" ]; then
    echo "  [OK]  $LB_NAME: access logs enabled"
  else
    echo "  [MISSING] $LB_NAME: access logs DISABLED"
    echo "    Without ELB access logs, 4xx/5xx breakdowns by client IP and target are unavailable."
    echo "    Enable: aws elbv2 modify-load-balancer-attributes \\"
    echo "      --load-balancer-arn $LB_ARN \\"
    echo "      --attributes Key=access_logs.s3.enabled,Value=true \\"
    echo "                   Key=access_logs.s3.bucket,Value=<your-logs-bucket> \\"
    echo "      --region $REGION"
  fi
done

echo ""
echo "=== 7. CLOUDWATCH ALARM COVERAGE ==="
SERVICES_WITH_ALARMS=$(aws cloudwatch describe-alarms \
  --region "$REGION" \
  --query 'MetricAlarms[].Namespace' \
  --output text | tr '\t' '\n' | sort -u)
echo "Namespaces with configured alarms:"
echo "$SERVICES_WITH_ALARMS" | awk '{print "  "$0}'

echo ""
echo "Key namespaces to check for alarm coverage:"
for NS in AWS/EC2 AWS/RDS AWS/EBS AWS/ApplicationELB AWS/NetworkELB AWS/Lambda AWS/EKS; do
  COUNT=$(aws cloudwatch describe-alarms \
    --region "$REGION" \
    --alarm-name-prefix "" \
    --query "MetricAlarms[?Namespace=='${NS}'] | length(@)" \
    --output text 2>/dev/null || echo "0")
  if [ "${COUNT:-0}" -eq 0 ]; then
    echo "  [MISSING] $NS: no alarms configured"
  else
    echo "  [OK]  $NS: $COUNT alarm(s)"
  fi
done

echo ""
echo "=== 8. ENHANCED MONITORING: RDS ==="
aws rds describe-db-instances \
  --region "$REGION" \
  --query 'DBInstances[*].{ID:DBInstanceIdentifier,MonitoringInterval:MonitoringInterval,PerformanceInsights:PerformanceInsightsEnabled}' \
  --output json | python3 -c "
import json, sys
for inst in json.load(sys.stdin):
    id_ = inst['ID']
    mi = inst.get('MonitoringInterval', 0)
    pi = inst.get('PerformanceInsights', False)
    issues = []
    if mi == 0: issues.append('Enhanced Monitoring DISABLED')
    elif mi > 15: issues.append(f'Enhanced Monitoring interval {mi}s (recommend 15 or less)')
    if not pi: issues.append('Performance Insights DISABLED')
    if issues:
        print(f'  [MISSING] {id_}: {\" | \".join(issues)}')
    else:
        print(f'  [OK]  {id_}: Enhanced Monitoring every {mi}s, Performance Insights enabled')
"
EOF
chmod +x ./diag-observability-gaps.sh
cat > ./prompt-observability-gaps.sh << 'EOF'
#!/bin/bash
set -euo pipefail
./diag-observability-gaps.sh | ./bedrock-ask.sh \
  "Analyse this observability readiness audit for a production AWS account. Your goal is to identify which logging and monitoring capabilities are missing, explain what incident scenarios each gap makes impossible to diagnose, and rank the gaps by severity.

The highest severity gaps are those that block root cause analysis for the most common incident types: VPC flow logs missing means you cannot diagnose network-level connectivity failures, security group regressions, or zero-window TCP stalls; Route 53 Resolver query logs missing means you cannot trace NXDOMAIN or SERVFAIL responses to specific query names or source instances during a DNS-related outage; CloudTrail not logging or not multi-region means you cannot determine which IAM principal made the configuration change that introduced an incident; RDS Performance Insights disabled means you cannot identify slow queries or lock contention without direct database access during an incident.

Medium severity gaps: RDS slow query logs not exported to CloudWatch means you cannot correlate database slowness with specific SQL patterns using the scripts in this guide; ELB access logs disabled means 4xx and 5xx breakdowns by client IP are unavailable; CloudWatch enhanced monitoring disabled on RDS means the 60-second and 15-second granularity metrics for OS-level resources are missing.

Lower severity but still meaningful: missing CloudWatch alarms on namespaces that have running resources means operational failures go undetected until they escalate; VPC flow logs enabled but without the tcp-flags custom field means zero-window and RST analysis is unavailable even though connection-level data is present.

For each gap, state: what specific diagnostic capability is lost, which section of this triage guide becomes unavailable, and the minimal enable command needed to close the gap."
EOF
chmod +x ./prompt-observability-gaps.sh

Before you have isolated the root cause, it helps to run a broad sweep across your application log groups to find the first error that appeared and trace from there.

cat > ./diag-logs-sweep.sh << 'EOF'
#!/bin/bash
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
LOG_GROUP_PREFIX="${1:-/aws/containerinsights}"
ANALYSIS_HOURS="${ANALYSIS_HOURS:-24}"
MINUTES_BACK="${2:-$(( ANALYSIS_HOURS * 60 ))}"

echo "=== LOG GROUP DISCOVERY ==="
aws logs describe-log-groups \
  --log-group-name-prefix "$LOG_GROUP_PREFIX" \
  --region "$REGION" \
  --query 'logGroups[].{Name:logGroupName,StoredMB:storedBytes,RetentionDays:retentionInDays}' \
  --output json | head -200

echo ""
echo "=== ERROR SWEEP ACROSS LOG GROUPS (last ${MINUTES_BACK} min) ==="

START_TIME=$(python3 -c "import time; print(int((time.time() - ${MINUTES_BACK}*60) * 1000))")
END_TIME=$(python3 -c "import time; print(int(time.time() * 1000))")

LOG_GROUPS=$(aws logs describe-log-groups \
  --log-group-name-prefix "$LOG_GROUP_PREFIX" \
  --region "$REGION" \
  --query 'logGroups[].logGroupName' \
  --output text 2>/dev/null | tr '\t' '\n' | head -20) || LOG_GROUPS=""

if [ -z "$LOG_GROUPS" ]; then
  echo "No log groups found with prefix '$LOG_GROUP_PREFIX' or describe-log-groups failed."
  echo "Check the prefix and that the prod-diagnostics role has logs:DescribeLogGroups permission."
fi

for LG in $LOG_GROUPS; do
  echo ""
  echo "--- Log group: $LG ---"

  QUERY_ID=$(aws logs start-query \
    --log-group-name "$LG" \
    --start-time "$START_TIME" \
    --end-time "$END_TIME" \
    --query-string 'fields @timestamp, @message
      | filter @message like /(?i)(error|exception|fatal|panic|OOM|timeout|refused|unavailable|500|503|CrashLoopBackOff|ImagePullBackOff|OOMKilled)/
      | sort @timestamp desc
      | limit 20' \
    --region "$REGION" \
    --query 'queryId' \
    --output text 2>/dev/null) || { echo "  [WARN] start-query failed for $LG" >&2; continue; }

  if [ -z "$QUERY_ID" ]; then
    echo "  [WARN] empty query ID for $LG - start-query may have failed silently" >&2
    continue
  fi

  # Poll until complete rather than blind sleep
  for _poll in {1..6}; do
    _status=$(aws logs get-query-results --query-id "$QUERY_ID" --region "$REGION" \
      --query 'status' --output text 2>/dev/null) || _status="Failed"
    [ "$_status" = "Complete" ] && break
    [ "$_status" = "Failed" ] && { echo "  [WARN] query failed for $LG" >&2; QUERY_ID=""; break; }
    sleep 5
  done

  [ -z "$QUERY_ID" ] && continue

  aws logs get-query-results \
    --query-id "$QUERY_ID" \
    --region "$REGION" \
    --output json 2>/dev/null | python3 -c "
import json, sys
data = json.load(sys.stdin)
results = data.get('results', [])
for r in results[:10]:
  row = {item['field']: item['value'] for item in r}
  ts = row.get('@timestamp', '')
  msg = row.get('@message', '')[:300]
  print(f'  [{ts}] {msg}')
" 2>/dev/null || echo "  (no results or query error)"
done
EOF
chmod +x ./diag-logs-sweep.sh
cat > ./prompt-logs-sweep.sh << 'EOF'
#!/bin/bash
set -euo pipefail
LOG_PREFIX="${1:-/aws/containerinsights}"
./diag-logs-sweep.sh "$LOG_PREFIX" 30 | ./bedrock-ask.sh \
  "Analyse these application log entries collected during a production incident. Your goal is to identify the earliest error that appeared and trace the cascade from there. Look for: the timestamp of the first error versus when the incident was reported, whether errors in different log groups share a common timestamp suggesting a coordinated failure point, error messages that point to specific downstream dependencies like a database connection string, a queue name, or an external API endpoint, stack traces that reveal the exact code path that is failing, and patterns where one service starts failing followed by other services failing in a cascade suggesting a dependency failure. The first error is almost always the most important one."
EOF
chmod +x ./prompt-logs-sweep.sh

13. The Full Incident Runbook

When you arrive at an incident and do not yet have a hypothesis, run these scripts in this order. Each one takes about two to four minutes to execute and pipe through Bedrock. Within twenty minutes you should have a prioritised list of candidates.

cat > ./incident-runbook.sh << 'EOF'
#!/bin/bash
set -euo pipefail
export AWS_PROFILE=prod-diagnostics

REGION="${1:-ap-southeast-1}"
export AWS_DEFAULT_REGION="$REGION"
DB_INSTANCE="${2:-}"
LOG_GROUP="${3:-/aws/vpc/flowlogs}"
K8S_CONTEXT="${4:-}"
NAMESPACE="${5:-}"
ANALYSIS_HOURS="${6:-24}"

export K8S_CONTEXT
export ANALYSIS_HOURS

echo "========================================"
echo " PRODUCTION INCIDENT DIAGNOSTIC RUNBOOK "
echo " $(date -u)"
echo " Region: $REGION"
echo " Analysis window: ${ANALYSIS_HOURS}h"
echo "========================================"
echo ""

TIMESTAMP=$(date +%Y%m%d-%H%M%S)
OUTPUT_DIR="./incidents/incident-${TIMESTAMP}"
EVIDENCE_DIR="$OUTPUT_DIR/evidence"
mkdir -p "$OUTPUT_DIR" "$EVIDENCE_DIR"
export EVIDENCE_DIR

run_and_save() {
  local name="$1"
  local script="$2"
  local prompt="$3"
  local evidence_file="${OUTPUT_DIR}/${name}.txt"
  local analysis_file="${OUTPUT_DIR}/${name}-analysis.txt"

  echo ""
  echo "========================================"
  echo " PHASE: $name"
  echo "========================================"

  # Collect evidence to disk first. Stdout and stderr both captured.
  # Redirection order: > file first, then 2>&1 redirects stderr to the
  # already-redirected stdout. Reversed order (2>&1 > file) leaves stderr
  # on the terminal instead of capturing it.
  # eval is used here because $script may include arguments (e.g. "./diag-foo.sh arg1 arg2").
  # All values are constructed internally by this script, not from user input,
  # so eval is safe in this context.
  if ! eval "$script" > "$evidence_file" 2>&1; then
    echo "[WARN] Phase $name exited non-zero. Evidence may be partial." >&2
  fi
  echo "[Evidence written to $evidence_file ($(wc -c < "$evidence_file") bytes)]"

  cat "$evidence_file" | ./bedrock-ask.sh "$prompt" | tee "$analysis_file"
  echo "[Analysis saved to $analysis_file]"
}

echo "Phase 1: Load balancer health..."
run_and_save "nlb" \
  "./diag-nlb.sh" \
  "Identify unhealthy targets, degraded load balancers, and availability zone failures."

echo ""
echo "Phase 2: Network security groups and routing..."
run_and_save "network-sg" \
  "./diag-network-sg.sh" \
  "Identify security group misconfigurations, open SSH/RDP, NACL blocking, and VPCs missing flow logs."

echo ""
echo "Phase 3: VPC flow logs and TCP signal analysis..."
run_and_save "flow-logs" \
  "./diag-flow-logs.sh $LOG_GROUP "" $ANALYSIS_HOURS" \
  "Identify rejected traffic patterns and anomalies in accepted flow volumes."

if [ -n "$K8S_CONTEXT" ]; then
  echo ""
  echo "Phase 4: Kubernetes pod state..."
  run_and_save "k8s-pods" \
    "./diag-k8s-pods.sh $NAMESPACE" \
    "Identify crashlooping pods, ImagePullBackOff containers, OOMKilled events, and node pressure."

  echo ""
  echo "Phase 5: CoreDNS health..."
  run_and_save "coredns" \
    "./diag-coredns.sh" \
    "Identify CoreDNS failures, misconfigured forwarders, and DNS resolution issues."

  echo ""
  echo "Phase 5b: DNS resolution paths and heterogeneous resolution..."
  run_and_save "dns-paths" \
    "./diag-dns-paths.sh" \
    "Identify ndots amplification, split DNS misrouting of internal names to public resolvers, and private hosted zone association gaps."
fi

echo ""
echo "Phase 6: RDS state and events..."
run_and_save "rds-state" \
  "./diag-rds-state.sh" \
  "Identify database availability issues, failover events, saturation metrics, and missing Performance Insights."

if [ -n "$DB_INSTANCE" ]; then
  echo ""
  echo "Phase 7: Performance Insights..."
  run_and_save "rds-pi" \
    "./diag-rds-pi.sh $DB_INSTANCE" \
    "Identify top wait events, slow SQL, and DB load contributing to the incident."

  echo ""
  echo "Phase 8: Slow query logs..."
  run_and_save "rds-slow" \
    "./diag-rds-slow-queries.sh $DB_INSTANCE $(( ANALYSIS_HOURS * 60 ))" \
    "Identify full table scans, lock contention, and high-frequency slow queries."

  echo ""
  echo "Phase 8b: Aurora query plan management and memory growth..."
  run_and_save "aurora-qpm" \
    "./diag-aurora-qpm.sh $DB_INSTANCE" \
    "Identify query plan regression, unapproved or rejected QPM plans accumulating calls, work_mem spill to disk, and stale statistics causing planner misestimates."
fi

echo ""
echo "Phase 9: OpenSearch Service health..."
run_and_save "opensearch" \
  "./diag-opensearch.sh" \
  "Identify cluster status red or yellow, JVM heap pressure above 80% heading toward the 92% write-block threshold, ClusterIndexWritesBlocked events, shard allocation failures, storage pressure, search and indexing latency spikes, and domains without zone awareness, dedicated master nodes, or slow log configuration."

echo ""
echo "Phase 10: Cache diagnostics (ElastiCache and DAX)..."
run_and_save "cache" \
  "./diag-cache.sh" \
  "Identify cache hit rate drops that indicate eviction pressure or expired TTLs driving miss storms to the database, MemoryFragmentationRatio above 1.5 indicating defragmentation is needed, EngineCPUUtilization saturation on the single Redis thread, replication lag above 10 seconds on replica nodes, resharding in progress (modifying status) causing write latency spikes on migrating slots, and DAX hit rates below 90% where the cache is adding latency without reducing DynamoDB load."

echo ""
echo "Phase 11: S3 access and error rates..."
run_and_save "s3" \
  "./diag-s3.sh" \
  "Identify S3 4xx errors, throttling, missing public access blocks, and lifecycle or policy changes."

echo ""
echo "Phase 12: Storage and IOPS bottleneck analysis..."
run_and_save "storage-iops" \
  "./diag-storage-iops.sh" \
  "Identify EBS burst credit exhaustion, instance-level EBS bandwidth ceiling mismatches, gp2 volumes that should be gp3, VolumeQueueLength above 1, and InstanceEBSIOPSExceededCheck or InstanceEBSThroughputExceededCheck throttle events."

echo ""
echo "Phase 13: Security and compliance sweep..."
run_and_save "security" \
  "./diag-security.sh" \
  "Identify CloudTrail gaps, expiring certificates, unprotected ALBs, SSM coverage gaps, and firing alarms."

echo ""
echo "Phase 14: Observability gap audit..."
run_and_save "observability-gaps" \
  "./diag-observability-gaps.sh" \
  "Identify missing VPC flow logs, missing Route 53 Resolver query logs, CloudTrail misconfigurations, RDS log export gaps, missing ELB access logs, and CloudWatch alarm coverage gaps."

echo ""
echo "Phase 15: Application log sweep..."
run_and_save "app-logs" \
  "./diag-logs-sweep.sh /aws/containerinsights $(( ANALYSIS_HOURS * 60 ))" \
  "Identify the first error timestamp and trace the failure cascade."

echo ""
echo "========================================"
echo " INCIDENT DIAGNOSTIC COMPLETE"
echo " All output saved to: $OUTPUT_DIR"
echo "========================================"

echo ""
echo "=== FINAL SYNTHESIS ==="
cat "${OUTPUT_DIR}"/*-analysis.txt | ./bedrock-ask.sh \
  "You have been given analysis outputs from up to seventeen diagnostic phases (the exact number depends on which EKS and database flags were set) covering load balancers, networking, TCP signal analysis, DNS resolution paths, Kubernetes pods, CoreDNS, RDS database state, Performance Insights, slow queries, Aurora query plan management, OpenSearch Service cluster health, ElastiCache and DAX cache diagnostics, S3, security, storage IOPS, observability gaps, and application logs. All evidence was collected over a ${ANALYSIS_HOURS}-hour analysis window. Your job is to synthesise these into a single incident report. Produce: (1) a one-paragraph executive summary of what is failing and why, (2) a ranked list of root cause candidates with the evidence supporting each and a confidence rating, (3) the three most important immediate remediation actions in priority order, and (4) the observability gap that prevented earlier detection of this incident. Be direct. Do not repeat the evidence back. Synthesise it." | tee "${OUTPUT_DIR}/final-synthesis.txt"

echo ""
echo "Final synthesis saved to: ${OUTPUT_DIR}/final-synthesis.txt"
EOF
chmod +x ./incident-runbook.sh

Run the full runbook as follows:

# Minimum invocation (24 hour analysis window, network and DB only):
./incident-runbook.sh ap-southeast-1

# Full invocation including EKS, a specific RDS instance, and default 24h window:
./incident-runbook.sh ap-southeast-1 my-prod-db /aws/vpc/flowlogs prod-diagnostics-my-cluster production

# Override analysis window to 4 hours (useful when incident is recent and well-bounded):
./incident-runbook.sh ap-southeast-1 my-prod-db /aws/vpc/flowlogs prod-diagnostics-my-cluster production 4

# Or set via environment variable, which all individual scripts also honour:
export ANALYSIS_HOURS=48
./incident-runbook.sh ap-southeast-1

14. A Complete Incident Walkthrough

Concepts are useful. A worked example showing exactly what the output looks like and how it changes a real decision is more useful. The following is an annotated walkthrough of a production checkout failure, showing actual evidence inputs, actual Bedrock output structure, and the human decision that resolved it. The values are representative rather than verbatim, but the flow is real.

Symptom reported at 02:14 UTC: Checkout endpoint returning 503 for approximately 35% of requests. Payment success rate declining. No deployment in the previous 4 hours.

Step 1: Fast blast radius estimate (2 minutes)

The engineer runs ./diag-nlb.sh and ./diag-security.sh and pipes both to bedrock-ask.sh with the question "Estimate blast radius from this early evidence. What is failing, who is affected, and what should we investigate first?" The response takes 40 seconds. This is the actual JSON that bedrock-ask.sh prints to the terminal:

{
  "incident_phase": "detect",
  "blast_radius": {
    "user_facing_impact": "Checkout and payment processing returning 503 for ~35% of requests based on ALB UnhealthyHostCount=3 of 9 targets",
    "services_impacted": ["checkout-api", "payment-worker"],
    "data_at_risk": "No data loss. Order writes failing at application layer before reaching database.",
    "estimated_recovery_time": "Unknown until root cause identified",
    "confidence": 0.71
  },
  "causal_graph": {
    "root_cause_candidate": "UNKNOWN - insufficient evidence at this stage",
    "propagation_chain": ["3 of 9 ALB targets unhealthy → 35% of requests hitting failed backends → 503"],
    "weakest_link_confidence": 0.41,
    "weakest_link_description": "Why the 3 targets became unhealthy is not yet in evidence"
  }
}

The blast radius is concrete enough to tell the engineer this is a partial backend failure, not a full outage, and the impact is isolated to checkout rather than account-wide. This changes the escalation decision: no need to page the CTO yet.

Step 2: Deep collection on checkout-api and EKS (8 minutes)

The engineer runs ./diag-k8s-pods.sh checkout and ./diag-rds-state.sh, collects evidence locally, and pipes both to Bedrock with "The blast radius shows 3 of 9 ALB targets unhealthy in the checkout namespace. Identify why."

Bedrock returns three hypotheses. Again, this is the JSON output printed to the terminal. Read it; do not run it:

{
  "incident_phase": "narrow",
  "hypotheses": [
    {
      "hypothesis": "checkout-api pods entering CrashLoopBackOff due to DNS resolution failure for payment-service.internal",
      "confidence": 0.79,
      "supporting_evidence": [
        {"observation": "payment-worker pods: 3 in CrashLoopBackOff, 12 restarts in 45 minutes",
         "evidence_quality": {"source": "app_logs", "trust": 0.72, "completeness": 0.9}},
        {"observation": "Last crash log line: dial tcp: lookup payment-service.internal: no such host",
         "evidence_quality": {"source": "app_logs", "trust": 0.72, "completeness": 0.9}}
      ],
      "contradicting_evidence": [
        "CoreDNS pod count shows 2 running pods (expected), no restarts in last hour"
      ],
      "next_best_query": "Check CoreDNS Corefile for payment-service.internal forwarding rule and compare against current VPC DHCP domain",
      "stop_condition": {
        "max_additional_queries": 3, "min_confidence_gain_per_query": 0.07,
        "current_confidence": 0.79, "queries_run": 0, "recommendation": "run_next_query"
      }
    },
    {
      "hypothesis": "RDS connection pool exhaustion blocking payment-worker startup",
      "confidence": 0.41,
      "supporting_evidence": [
        {"observation": "DatabaseConnections: 487 (current) vs baseline avg 142",
         "evidence_quality": {"source": "cloudwatch_metrics", "trust": 0.72, "completeness": 0.85}}
      ],
      "contradicting_evidence": [
        "Crash log shows DNS failure not connection refused or timeout",
        "RDS status: available, no events in last 4 hours"
      ],
      "next_best_query": "Check Performance Insights for connection wait events",
      "stop_condition": {"recommendation": "run_next_query"}
    }
  ],
  "external_sources": {
    "checked": [],
    "not_checked": ["payment gateway API status", "feature flag state", "CDN health"]
  },
  "baseline_delta": "DatabaseConnections +336 above p95 baseline. All other RDS metrics within normal range."
}

The engineer notes that the highest-confidence hypothesis (DNS failure, 0.79) has a contradicting data point: CoreDNS pods are healthy. The evidence quality annotation on the crash log is trust: 0.72 (medium, application log). The baseline delta on database connections is interesting but the crash log points elsewhere.

Step 3: CoreDNS deep dive (3 minutes)

The engineer runs ./diag-dns-paths.sh and ./diag-coredns.sh, targeting the next best query: "Inspect the CoreDNS Corefile for the payment-service.internal forwarding rule and compare against VPC DHCP options"

Bedrock returns the following. This is the confirmation output the engineer reads to decide on remediation:

{
  "incident_phase": "confirm",
  "hypotheses": [
    {
      "hypothesis": "CoreDNS Corefile contains stub zone for payment-service.internal pointing to 10.100.0.53, but VPC DHCP domain was changed from payment-service.internal to payment.internal 3 hours ago, so payment-service.internal names resolve NXDOMAIN",
      "confidence": 0.92,
      "supporting_evidence": [
        {"observation": "CoreDNS Corefile stub zone: payment-service.internal { forward . 10.100.0.53 }",
         "evidence_quality": {"source": "cloudwatch_metrics", "trust": 0.82, "completeness": 1.0}},
        {"observation": "CloudTrail: ModifyVpcAttribute at 23:11 UTC changing VPC DNS domain to payment.internal",
         "evidence_quality": {"source": "cloudtrail", "trust": 0.98, "completeness": 1.0}},
        {"observation": "Crash log: lookup payment-service.internal: no such host",
         "evidence_quality": {"source": "app_logs", "trust": 0.72, "completeness": 0.9}}
      ],
      "contradicting_evidence": [
        "CoreDNS pods are healthy with no restarts (explains why not all pods failed - only those making payment-service.internal calls)"
      ],
      "stop_condition": {
        "current_confidence": 0.92, "queries_run": 1, "recommendation": "escalate_to_human"
      }
    }
  ],
  "causal_graph": {
    "root_cause_candidate": "ModifyVpcAttribute at 23:11 UTC changed VPC DNS domain without updating CoreDNS stub zone",
    "propagation_chain": [
      "VPC domain changed to payment.internal",
      "CoreDNS stub zone still references payment-service.internal",
      "payment-worker DNS lookup for payment-service.internal returns NXDOMAIN",
      "payment-worker pods crash on startup",
      "3 of 9 ALB targets go unhealthy",
      "35% of checkout requests return 503"
    ],
    "weakest_link_confidence": 0.89,
    "weakest_link_description": "Assumed CloudTrail ModifyVpcAttribute at 23:11 is the same change; verify API caller identity"
  },
  "immediate_actions": [
    "Update CoreDNS Corefile stub zone from payment-service.internal to payment.internal",
    "Run: kubectl -n kube-system edit configmap coredns",
    "Restart CoreDNS: kubectl rollout restart deployment/coredns -n kube-system",
    "Verify: kubectl exec -it <pod> -- nslookup payment.internal"
  ]
}

Resolution: The engineer updates the CoreDNS Corefile, restarts CoreDNS, and verifies DNS resolution. ALB targets recover within 2 minutes as pods restart successfully. Total time to detected root cause: 13 minutes. Total time to resolution: 18 minutes.

Without this tooling, the previous average time to detection for this incident class had been 47 minutes, dominated by the time spent investigating the database connection count anomaly (which was a real signal but not the cause) before someone thought to check DNS.

The worked example demonstrates several properties of the system that are worth naming explicitly. Blast radius estimation before deep collection told the engineer where to focus first. The evidence quality annotations on the medium-trust application logs prevented the DNS hypothesis from being accepted at face value without corroboration from CloudTrail. The contradicting evidence field on the DNS hypothesis correctly identified that healthy CoreDNS pods needed explanation, which led to the discovery that only pods making payment-service.internal calls were failing. And the stop condition fired correctly at confidence 0.92 after one additional query, preventing the engineer from continuing to gather evidence when the hypothesis was already confirmed.

15. Example Bedrock Prompts by Incident Type

Beyond the automated scripts, here are direct prompts you can use interactively during an incident when you have a specific hypothesis to test. Pipe any relevant data through bedrock-ask.sh with these as the question argument.

“Services are returning 503 but pods look healthy”

Look for: target group health showing all targets healthy but a load balancer returning 503 upstream, a misconfigured health check path that passes but the application is actually in a degraded state, connection pool exhaustion at the application layer where the application is running but refusing new connections, or a certificate expiry on the HTTPS listener causing SSL handshake failures that manifest as 503.

“Database connections are exhausted”

Look for: a spike in application instances that each hold a fixed connection pool, a connection leak where connections are not being returned after use visible in DatabaseConnections rising monotonically, a long-running query holding connections open visible in Performance Insights wait events, or a maintenance window that reduced max_connections temporarily.

“Intermittent failures with no clear pattern”

Look for: a partial AZ failure where some targets are healthy and some are not causing round-robin to hit failed backends, CoreDNS under CPU pressure causing intermittent resolution failures for some queries but not others, a flapping security group rule that was recently modified, or a spot instance interruption in a node group causing periodic pod rescheduling. Also run diag-dns-paths.sh since ndots:5 amplification and split DNS misrouting both produce intermittent failures that appear random because only some pods are affected or because only some destination names follow the failing resolution path.

“Latency has doubled but error rate is normal”

Look for: a new slow query introduced by a code deploy visible in Performance Insights top SQL by load, an index that was dropped or rebuilt recently causing plan regression, increased garbage collection pauses in JVM applications visible in pod CPU spikes, a CDN cache invalidation that pushed load to origin, or an autoscaling group that scaled down and is now under-provisioned. On Aurora PostgreSQL specifically, run diag-aurora-qpm.sh and look at whether FreeableMemory has been declining gradually before the latency increase, which is the signature of a plan regression causing larger hash joins or sort spills rather than a traffic increase.

“Aurora PostgreSQL memory is growing and the cluster OOM-restarted”

Run diag-aurora-qpm.sh with direct DB access enabled. Look for: plans in the dba_plans view whose first_used timestamp aligns with the start of the memory growth trend; high temp_bytes in pg_stat_database indicating sort operations are spilling to disk because work_mem is too low; tables with high dead_tup_pct whose stale statistics caused the planner to choose a hash join with an unexpectedly large hash table; and whether rds.enable_plan_management is set to 1 but apg_plan_mgmt.use_plan_baselines is Off, which means QPM is capturing plans but not enforcing them. The fix for plan regression is to approve the last known good plan in dba_plans using apg_plan_mgmt.set_plan_status and then set use_plan_baselines to On to prevent further regressions.

“Only some services can reach internal endpoints; others time out”

Run diag-dns-paths.sh. The most likely cause is split DNS misconfiguration: the CoreDNS Corefile is forwarding queries for internal domains to a public upstream resolver rather than to the VPC resolver. Private hosted zone records are invisible from outside the VPC, so the public resolver returns NXDOMAIN and the application sees a connection failure. Compare the forward directive in the Corefile against the DHCP options on the VPC. Also check whether the private hosted zone is actually associated with the correct VPC in the Route 53 console, since a zone that exists but has lost its VPC association after an account restructure will fail silently.

16. Operational Hygiene: Run These Before an Incident Happens

The most useful time to run these diagnostics is not during an incident. It is the week before. Running the full suite against a healthy environment produces a baseline. When the incident happens you have a before-and-after picture that makes the anomaly obvious.

cat > ./baseline-snapshot.sh << 'EOF'
#!/bin/bash
set -euo pipefail
export AWS_PROFILE=prod-diagnostics
REGION="${1:-ap-southeast-1}"
export AWS_DEFAULT_REGION="$REGION"

TIMESTAMP=$(date +%Y%m%d-%H%M%S)
BASELINE_DIR="./baselines/baseline-${TIMESTAMP}"
mkdir -p "$BASELINE_DIR"

echo "Collecting production baseline snapshot..."
echo "Output directory: $BASELINE_DIR"

./diag-nlb.sh > "${BASELINE_DIR}/nlb.json" 2>&1
echo "NLB state captured"

./diag-network-sg.sh > "${BASELINE_DIR}/network-sg.json" 2>&1
echo "Network security captured"

./diag-rds-state.sh > "${BASELINE_DIR}/rds-state.json" 2>&1
echo "RDS state captured"

./diag-s3.sh > "${BASELINE_DIR}/s3.json" 2>&1
echo "S3 state captured"

./diag-security.sh > "${BASELINE_DIR}/security.json" 2>&1
echo "Security posture captured"

aws ec2 describe-instances \
  --region "$REGION" \
  --query 'Reservations[].Instances[].{ID:InstanceId,Type:InstanceType,State:State.Name,AZ:Placement.AvailabilityZone,LaunchTime:LaunchTime}' \
  --output json > "${BASELINE_DIR}/ec2.json" 2>&1
echo "EC2 state captured"

echo ""
echo "Baseline saved to: $BASELINE_DIR"
echo "Archive this directory. During the next incident, diff it against a fresh collection."
tar czf "./baselines/baseline-${TIMESTAMP}.tar.gz" -C ./baselines "baseline-${TIMESTAMP}"
echo "Compressed: ./baselines/baseline-${TIMESTAMP}.tar.gz"
EOF
chmod +x ./baseline-snapshot.sh

17. A Note on What Bedrock Is and Is Not Doing Here

The architecture and operating constraints of this workflow are covered in section 1. This closing note is about what those constraints mean in practice when you are using this system under pressure.

The important thing to understand about how this workflow operates is that Bedrock is not reading your infrastructure. It is reading structured text that your shell scripts have extracted from your infrastructure and written to disk. It has no network access to your account, no API credentials, and no awareness of your environment beyond what appears in the prompt. This is a feature, not a limitation, because it means the evidence contract in section 3 can be enforced absolutely. Bedrock cannot bypass it by querying AWS directly to check a finding.

The practical consequence of this design is that the quality of analysis is bounded by the quality of collection. If your scripts are querying the wrong time window, covering the wrong metric namespaces, or not reaching a service that is involved in the failure, Bedrock will reason well on incomplete data and arrive at plausible but incorrect conclusions. The scripts in this guide are a starting point. Adapt them to the specific log groups, metric namespaces, and service families that matter in your environment.

The second implication is that you own the inference chain. Bedrock produces hypotheses with confidence scores. You verify the highest-confidence ones against what you know and decide what action to take. The confidence scores in the structured output are a guide to prioritisation, not a measure of certainty. A hypothesis at 0.81 confidence with strong contradicting evidence listed is less reliable than one at 0.72 with no significant contradictions. Read the full structured output, not just the headline score.

The scripts here have been written to run on Amazon Linux 2, macOS with GNU coreutils, and Ubuntu 22 or later. The date command options differ between BSD and GNU variants, which is why some scripts use Python for timestamp arithmetic rather than relying on platform-specific date flags.

Prerequisites That Must Be in Place Before You Need This

Performance Insights should be enabled on every production RDS instance. The cost is minimal and the diagnostic value is disproportionate. The diag-rds-pi.sh script becomes useless without it.

VPC Flow Logs should be enabled with the custom format including tcp-flags and sending to CloudWatch Logs for every production VPC. Without flow logs you cannot confirm what traffic is flowing, diagnose security group regressions, or identify TCP zero-window stalls.

Route 53 Resolver query logs should be enabled for every production VPC. Without them, NXDOMAIN responses during a DNS incident cannot be traced to specific query names or source instances.

Slow query logging should be enabled and exported to CloudWatch Logs on all RDS MySQL and Aurora MySQL instances. Set slow_query_log = 1 and long_query_time = 1 as a starting point.

Run ./diag-observability-gaps.sh before an incident. It audits all of these prerequisites and prints the enable commands for anything missing.

The answer to the ultimate question of production outages, network failures, and everything else is almost always either a security group change you did not notice or a database query without an index. This guide helps you find out which one it is faster.

Appendix A: Bedrock Quota Check

The default Bedrock service quotas in most regions are too low for sustained use of this guide during an incident. For Claude 3.5 Sonnet, the out-of-the-box limits are around 50 requests per minute and 50,000 to 100,000 input tokens per minute. The diagnostic runbook makes between 12 and 18 Bedrock calls depending on which EKS and database flags are set, each carrying tens of thousands of tokens. Run this script once before your first incident. Quota increase requests take one to three business days.

cat > ./check-bedrock-quotas.sh << 'EOF'
#!/bin/bash
# check-bedrock-quotas.sh: Inspect Bedrock service quotas and flag anything below recommended thresholds
set -euo pipefail

export AWS_PROFILE=prod-diagnostics
REGION="${AWS_DEFAULT_REGION:-ap-southeast-1}"
MODEL_SHORT="claude-3-5-sonnet"
RECOMMENDED_RPM=300
RECOMMENDED_INPUT_TPM=400000
RECOMMENDED_OUTPUT_TPM=40000
BEDROCK_SERVICE_CODE="bedrock"

echo "========================================================"
echo " BEDROCK QUOTA CHECK  |  Region: $REGION  |  $(date -u)"
echo "========================================================"

QUOTAS=$(aws service-quotas list-service-quotas \
  --service-code "$BEDROCK_SERVICE_CODE" --region "$REGION" --output json 2>/dev/null) || {
    echo "ERROR: Could not retrieve Service Quotas."
    echo "Check manually: https://console.aws.amazon.com/servicequotas/home/services/bedrock/quotas"
    exit 1
  }

DEFAULT_QUOTAS=$(aws service-quotas list-aws-default-service-quotas \
  --service-code "$BEDROCK_SERVICE_CODE" --region "$REGION" --output json 2>/dev/null) \
  || DEFAULT_QUOTAS='{"Quotas":[]}'

python3 << PYEOF
import json, sys

applied_raw = '''$QUOTAS'''
default_raw = '''$DEFAULT_QUOTAS'''
region      = '$REGION'
model_short = '$MODEL_SHORT'
rec_rpm     = $RECOMMENDED_RPM
rec_itpm    = $RECOMMENDED_INPUT_TPM
rec_otpm    = $RECOMMENDED_OUTPUT_TPM

try:
    applied  = json.loads(applied_raw).get('Quotas', [])
    defaults = json.loads(default_raw).get('Quotas', [])
except Exception as e:
    print(f"Could not parse quota JSON: {e}"); sys.exit(1)

quota_map = {}
for q in defaults: quota_map[q['QuotaCode']] = q
for q in applied:  quota_map[q['QuotaCode']] = q

model_quotas = [q for q in quota_map.values()
                if model_short.lower() in q.get('QuotaName', '').lower()]

if not model_quotas:
    print(f"No quotas found for '{model_short}' in {region}.")
    print(f"Enable model at: https://{region}.console.aws.amazon.com/bedrock/home#/modelaccess")
    sys.exit(1)

issues = []
for q in sorted(model_quotas, key=lambda x: x.get('QuotaName', '')):
    name = q.get('QuotaName', 'Unknown'); code = q.get('QuotaCode', '')
    value = q.get('Value', 0); adj = q.get('Adjustable', False)
    threshold = label = None
    if 'requests per minute' in name.lower(): threshold, label = rec_rpm, 'RPM'
    elif 'input token' in name.lower():       threshold, label = rec_itpm, 'input TPM'
    elif 'output token' in name.lower():      threshold, label = rec_otpm, 'output TPM'
    status = ''
    if threshold and value < threshold:
        status = f'  BELOW RECOMMENDED ({threshold} {label})'
        issues.append({'name': name, 'code': code, 'current': value,
                       'recommended': threshold, 'label': label, 'adjustable': adj})
    elif threshold:
        status = '  OK'
    print(f"  {name}: {int(value)}{status}")
    print(f"    QuotaCode: {code}  Adjustable: {'Yes' if adj else 'No'}")

if not issues:
    print("\nAll quotas meet recommended thresholds.")
else:
    print(f"\n{len(issues)} quota(s) need increasing.")
    print("\nAWS Console increase URLs:")
    for i in issues:
        print(f"  {i['name']}: {int(i['current'])} -> {i['recommended']} {i['label']}")
        print(f"  https://console.aws.amazon.com/servicequotas/home/services/bedrock/quotas/{i['code']}")
    print("\nAWS CLI (requires servicequotas:RequestServiceQuotaIncrease):")
    for i in issues:
        if i['adjustable']:
            print(f"  aws service-quotas request-service-quota-increase \\")
            print(f"    --service-code bedrock --quota-code {i['code']} \\")
            print(f"    --desired-value {i['recommended']} --region {region}")
    print(f"\nProvisioned Throughput (immediate, billed per minute):")
    print(f"  https://{region}.console.aws.amazon.com/bedrock/home#/provisioned-throughput")
PYEOF
EOF
chmod +x ./check-bedrock-quotas.sh
AWS_PROFILE=prod-diagnostics AWS_DEFAULT_REGION=ap-southeast-1 ./check-bedrock-quotas.sh

Appendix B: Reference Links

The scripts and analysis prompts in this guide draw on the following AWS documentation, engineering resources, and community research. These links are useful for deeper reading on any specific diagnostic area and for understanding the AWS service limits, quota behaviours, and configuration parameters that the scripts surface.

ElastiCache Redis and DAX

OpenSearch Service

EBS Performance and IOPS

  • Amazon EBS volume types covers gp2, gp3, io1, io2, st1, and sc1 performance specifications including IOPS ceilings and throughput limits per volume type.
  • Amazon EBS-optimized instances lists the baseline and maximum EBS bandwidth, IOPS, and throughput for every EC2 instance type, which is required to identify the instance-level bottleneck when volume IOPS appear sufficient.
  • Troubleshoot Amazon EBS performance issues on EC2 describes the InstanceEBSIOPSExceededCheck, InstanceEBSThroughputExceededCheck, VolumeIOPSExceededCheck, and BurstBalance CloudWatch metrics that the storage IOPS script collects.

VPC Flow Logs and TCP Analysis

DNS, CoreDNS, and EKS Resolution

Aurora PostgreSQL Query Plan Management

CloudTrail, Logging, and Observability

Bedrock and Service Quotas