Why do most cloud governance programmes fail over time?

Most cloud governance programmes start with a handful of useful controls but gradually accumulate tagging standards, cost optimisation checks, benchmark scanning, and configuration standards. Over time the report contains thousands of findings across hundreds of categories, becoming impossible to act on. The core problem is that teams end up scoring hygiene noise instead of focusing on risks that could seriously harm the business.

How fast can an AWS cloud intrusion escalate to full compromise?

According to Sysdig's Threat Research Team, who documented an AWS intrusion in November 2025, an attacker moved from a single exposed credential to full administrative control in under ten minutes, crossing nineteen distinct principals and abusing AWS Bedrock for compute. This demonstrates that the time dimension is now the most critical factor in cloud security risk.

What four questions should a board be asking about cloud risk?

Rather than asking whether checkboxes are ticked, boards should focus on: what an attacker can do once inside the environment, what the blast radius of a compromise would be, how long recovery would take, and whether recovery is even feasible. The article frames this as a progression from current state to abuse path to business loss to recovery feasibility.

What is the key shift in mindset recommended for cloud risk programmes?

The key shift is to stop auditing configurations as isolated findings and start measuring catastrophic loss pathways. Specifically, teams should measure what an attacker can do and how long recovery would take, rather than whether a best-practice checkbox is ticked. An unlocked door in an empty room is a hygiene issue, but an unlocked door that allows deletion of every backup within fifteen minutes with no recovery path is a material risk and should never appear at the same severity level.

Stop trying to audit AWS Cloud Risks and Start Measuring Catastrophic Loss Pathways

07 Jun 2026 Public Cloud

AWS Loss Pathways: Measure Blast Radius, Not Audit Noise

Q: What is the difference between measuring cloud exposure and measuring loss pathways?

Measuring exposure answers whether a configuration violates a best practice, while measuring loss pathways answers whether a misconfiguration could lead to fraud, ransomware, a material outage, a regulatory breach, irreversible deletion, customer harm, or business shutdown. A configuration finding tells you a door is unlocked; a loss pathway tells you what is behind the door, how fast someone can walk through it, and whether you can ever recover what is taken.

👁16views

Measuring catastrophic loss pathways means tracing the specific sequences of misconfiguration, exposed credential, lateral movement, and data exfiltration that could produce a company-ending event, rather than aggregating thousands of low-signal findings into a compliance score. Prioritising these pathways forces teams to ask which combinations of risk, not which individual controls, create existential exposure.

CloudScale AI SEO - Article Summary

1.
What it is
AWS cloud risk programmes fail because they measure exposure rather than loss pathways. This article shows how to reframe detection around four questions: current state, abuse path, business loss, and recovery feasibility.
2.
Why it matters
A finding like 'deletion protection disabled' is unactionable. Reframing it as 'a single compromised role can delete the production payments database with a four day recovery window' gives executives something they can fund and be accountable for.
3.
Key takeaway
Sysdig documented a 2025 AWS intrusion where one exposed credential reached full administrative control across nineteen principals in under ten minutes, proving that recovery time and attack speed are now the metrics that matter, not configuration scores.

~27 min read

Most cloud governance programmes begin with good intentions and eventually collapse under their own weight. The team starts with a handful of useful controls, someone adds tagging standards, another team adds cost optimisation, security introduces benchmark scanning, and platform engineering introduces configuration standards. Before long the report contains thousands of findings spread across hundreds of categories, and what initially felt like maturity eventually becomes impossible to act on.

That much is a familiar problem, and the first answer to it is to stop scoring lint and start scoring the things that can kill the company. But there is a second, subtler problem hiding underneath the first, and it is the one that most risk programmes never escape. Even after you strip out the hygiene noise, you are usually still measuring exposure rather than loss. You are answering the question “does this violate a best practice” when the question that actually matters to a board is “could this lead to fraud, ransomware, a material outage, a regulatory breach, irreversible deletion, customer harm, or business shutdown.” Those are not the same question, and the gap between them is where companies die.

A configuration finding tells you a door is unlocked. A loss pathway tells you what is behind the door, how fast someone can walk through it, and whether you can ever get back what they take. The evolution this post argues for is the move from current state to abuse path to business loss to recovery feasibility. An unlocked door in an empty room is hygiene. An unlocked door that leads to the ability to delete every backup in under fifteen minutes with no recovery path is a material risk, and the two should never again appear in the same report at the same severity. Sysdig’s Threat Research Team documented an AWS intrusion in November 2025 in which an attacker went from a single exposed credential to full administrative control in under ten minutes, moving across nineteen distinct principals and abusing Bedrock for compute, which is the clearest recent demonstration that the time dimension is now the dimension that matters most.

This post keeps the per service detectors that find dangerous state, because you cannot reason about loss pathways without first knowing the state of the environment, but it reframes the whole exercise around the four questions a board actually asks, and it adds the script families that measure abuse paths, blast radius, and recovery feasibility directly. The shift in framing is simple to state and hard to live by: measure what an attacker can do and how long you would take to recover, not whether a checkbox is ticked.

1. The Design Principles for a Loss Pathway Detector

Traditional audit approaches depend heavily on human responses. Teams are asked whether deletion protection exists, whether backups are configured, whether production is recoverable, and those answers have very little value because cloud platforms already know the truth. A better approach is to build a collection of scripts that interrogate AWS directly and produce findings, but the design principle that separates a loss pathway detector from a hygiene scanner is the question each check is required to answer.

A hygiene check answers “does this violate a best practice.” A loss pathway check answers four questions in sequence. What is the current state. What can an attacker or a mistake do from that state, which is the abuse path. What does the business lose if they do it, which is the business loss. And how long would the organisation take to detect, contain, and recover, which is the recovery feasibility. A finding that cannot articulate all four is not yet a material risk finding; it is a configuration observation waiting to be promoted or discarded. The 2025 AWS attack patterns documented by Sysdig and the softwaresecured analysis of AWS privilege escalation both make the same point from different angles: the dangerous thing is rarely the misconfiguration itself, it is the chain of valid actions that the misconfiguration makes possible.

Each detector still operates in read only mode, produces structured output, and includes evidence for every finding, but the output model grows to carry the loss pathway dimensions so that the report speaks in the language a board understands rather than the language a linter produces:

SEVERITY | risk_type | account | region | service | resource | finding | evidence | abuse_path | business_loss | detect_mins | contain_mins | recover_mins | recommendation

A populated finding therefore reads as a complete story rather than a fragment. Instead of “deletion protection disabled” it reads “a single compromised deployment role can delete the production payments database, the business loses all transaction processing, detection takes roughly two hours, containment twelve, and full recovery four days.” That sentence is something an executive can act on, fund, and be accountable for, which is the entire point of measuring loss pathways rather than exposure.

The purpose of the report is not to explain architecture or to prove the environment is clean. It is to expose what the business cannot survive losing and to prove, with evidence drawn directly from the platform, how long recovery would take. Every finding that does not serve that purpose should be removed rather than filed under a lower severity.

The detector is organised into the original per service family, which establishes current state, and a set of loss pathway families that establish abuse paths and recovery feasibility. Each script owns one domain and emits findings independently into a shared findings directory, with an aggregation layer combining the results and computing the blast radius times time score:

risk-detector/
├── run-all.sh
├── score.py
├── risk-service/              # current state: dangerous configuration
│   ├── ec2-risk.sh
│   ├── rds-risk.sh
│   ├── s3-risk.sh
│   ├── eks-risk.sh
│   ├── route53-risk.sh
│   ├── kms-risk.sh
│   ├── backup-risk.sh
│   ├── iam-risk.sh
│   ├── cloudtrail-risk.sh
│   ├── vpc-risk.sh
│   ├── secrets-risk.sh
│   ├── guardduty-risk.sh
│   └── organisations-risk.sh
├── risk-identity/             # identity blast radius and takeover paths
│   └── identity-blast-radius.sh
├── risk-destructive/          # can an attacker destroy prod, and how fast
│   └── destructive-capability.sh
├── risk-observability/        # can an attacker go invisible
│   └── detection-blindness.sh
├── risk-data/                 # irrecoverable data and encryption hostage
│   └── data-destruction.sh
├── risk-supply-chain/         # CI/CD and pipeline compromise impact
│   └── pipeline-compromise.sh
├── risk-economic/             # denial of wallet and cloud bankruptcy
│   └── economic-dos.sh
├── risk-integrity/            # fraud and silent manipulation
│   └── integrity-attack.sh
├── risk-dependency/           # concentration and single points of failure
│   └── dependency-concentration.sh
├── risk-catastrophic/         # company ending binary tests
│   └── catastrophic-events.sh
└── findings/

2. EC2 Risk Detection

Compute itself rarely creates permanent damage. Exposure, deletion and failed recovery do. The EC2 script should focus on identifying conditions that could expose systems, destroy compute unexpectedly, or make recovery impossible. Qualys’s analysis of a 2025 automotive data leak traced the root cause directly to overly broad IAM permissions and hardcoded credentials, demonstrating that compute-level exposure routinely begins with a network or identity misconfiguration rather than a flaw in the application itself.

The first class of findings focuses on network exposure. Security groups should be reported whenever a CIDR broader than /24 is allowed to reach sensitive administrative or data ports. The sensitive port list should include SSH on 22, RDP on 3389, SQL Server on 1433, MySQL on 3306, PostgreSQL on 5432, Redis on 6379, MongoDB on 27017, Elasticsearch on 9200 and 9300, and the Kubernetes API server on 6443. The report should not complain about broad CIDRs in isolation; it should only trigger when broad access intersects with a sensitive capability.

The second class of findings focuses on recovery. Production instances should be reported when termination protection is disabled, when attached EBS volumes are configured to delete automatically on instance termination, or when no recent recovery artefacts exist.

cat > risk-service/ec2-risk.sh << 'EOF'
#!/usr/bin/env bash
# ec2-risk.sh - Material EC2 risk detection
# Requires: aws cli v2, jq
# Usage: ./ec2-risk.sh [--region eu-west-1] [--profile myprofile]

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/ec2-findings.txt"
mkdir -p findings

SENSITIVE_PORTS=(22 3389 1433 3306 5432 6379 27017 9200 9300 6443)

emit_finding() {
  local severity="$1" risk_type="$2" resource="$3" finding="$4" evidence="$5" recommendation="$6"
  echo "${severity}|${risk_type}|${ACCOUNT}|${REGION}|ec2|${resource}|${finding}|${evidence}|${recommendation}" \
    | tee -a "$OUTPUT_FILE"
}

echo "=== EC2 Material Risk Scan: account=${ACCOUNT} region=${REGION} ===" | tee "$OUTPUT_FILE"

# -----------------------------------------------------------------------
# 1. Security groups with broad CIDR access to sensitive ports
# -----------------------------------------------------------------------
echo "[*] Checking security groups for broad access to sensitive ports..."

aws ec2 describe-security-groups \
  --query 'SecurityGroups[*].{GroupId:GroupId,GroupName:GroupName,Rules:IpPermissions}' \
  --output json | jq -c '.[]' | while IFS= read -r sg; do
    sg_id=$(echo "$sg" | jq -r '.GroupId')
    sg_name=$(echo "$sg" | jq -r '.GroupName')

    echo "$sg" | jq -c '.Rules[]?' | while IFS= read -r rule; do
      from_port=$(echo "$rule" | jq -r '.FromPort // 0')
      to_port=$(echo "$rule" | jq -r '.ToPort // 65535')

      for port in "${SENSITIVE_PORTS[@]}"; do
        if [ "$from_port" -le "$port" ] && [ "$to_port" -ge "$port" ] 2>/dev/null; then

          # Check IPv4 broad CIDRs (anything with prefix /24 or shorter, i.e. /0 to /23)
          echo "$rule" | jq -r '.IpRanges[]?.CidrIp // empty' | while IFS= read -r cidr; do
            prefix=$(echo "$cidr" | cut -d'/' -f2)
            if [ -n "$prefix" ] && [ "$prefix" -lt 24 ] 2>/dev/null; then
              emit_finding "HIGH" "SECURITY_EXPOSURE" "${sg_id}(${sg_name})" \
                "Broad CIDR ${cidr} permits access to sensitive port ${port}" \
                "FromPort=${from_port},ToPort=${to_port},CIDR=${cidr}" \
                "Restrict to known IP ranges or use VPN/bastion access"
            fi
          done

          # Check IPv6 broad CIDRs
          echo "$rule" | jq -r '.Ipv6Ranges[]?.CidrIpv6 // empty' | while IFS= read -r cidr6; do
            if echo "$cidr6" | grep -qE '^::\/0$|^::/0$'; then
              emit_finding "HIGH" "SECURITY_EXPOSURE" "${sg_id}(${sg_name})" \
                "IPv6 open access (::/0) permits access to sensitive port ${port}" \
                "FromPort=${from_port},ToPort=${to_port},CIDRv6=${cidr6}" \
                "Restrict IPv6 access to known prefixes or remove rule"
            fi
          done
        fi
      done
    done
  done

# -----------------------------------------------------------------------
# 2. Production instances without termination protection
# -----------------------------------------------------------------------
echo "[*] Checking instances for missing termination protection..."

aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running,stopped" \
  --query 'Reservations[*].Instances[*].{Id:InstanceId,Tags:Tags}' \
  --output json | jq -c '.[][]' | while IFS= read -r instance; do
    instance_id=$(echo "$instance" | jq -r '.Id')

    # Check if tagged as production (prod, production, prd in Name or Environment tags)
    is_prod=$(echo "$instance" | jq -r '
      .Tags // [] |
      map(select(.Key == "Environment" or .Key == "Name")) |
      map(.Value | ascii_downcase) |
      map(select(test("prod|prd|production"))) |
      length > 0
    ')

    if [ "$is_prod" = "true" ]; then
      protection=$(aws ec2 describe-instance-attribute \
        --instance-id "$instance_id" \
        --attribute disableApiTermination \
        --query 'DisableApiTermination.Value' \
        --output text 2>/dev/null || echo "false")

      if [ "$protection" != "true" ]; then
        emit_finding "CRITICAL" "ACCIDENTAL_DELETION" "$instance_id" \
          "Production instance has termination protection disabled" \
          "DisableApiTermination=false" \
          "Enable termination protection: aws ec2 modify-instance-attribute --instance-id ${instance_id} --disable-api-termination"
      fi
    fi
  done

# -----------------------------------------------------------------------
# 3. EBS volumes configured to auto-delete on instance termination
# -----------------------------------------------------------------------
echo "[*] Checking EBS volume delete-on-termination for running instances..."

aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[*].Instances[*].{Id:InstanceId,Tags:Tags,Mappings:BlockDeviceMappings}' \
  --output json | jq -c '.[][]' | while IFS= read -r instance; do
    instance_id=$(echo "$instance" | jq -r '.Id')

    is_prod=$(echo "$instance" | jq -r '
      .Tags // [] |
      map(select(.Key == "Environment" or .Key == "Name")) |
      map(.Value | ascii_downcase) |
      map(select(test("prod|prd|production"))) |
      length > 0
    ')

    if [ "$is_prod" = "true" ]; then
      echo "$instance" | jq -c '.Mappings[]?' | while IFS= read -r mapping; do
        device=$(echo "$mapping" | jq -r '.DeviceName')
        delete_on_term=$(echo "$mapping" | jq -r '.Ebs.DeleteOnTermination // false')
        volume_id=$(echo "$mapping" | jq -r '.Ebs.VolumeId // "unknown"')

        if [ "$delete_on_term" = "true" ]; then
          emit_finding "HIGH" "DATA_LOSS" "$instance_id" \
            "EBS volume ${volume_id} on device ${device} is configured to delete on instance termination" \
            "DeleteOnTermination=true,Device=${device},VolumeId=${volume_id}" \
            "Set DeleteOnTermination=false for production data volumes"
        fi
      done
    fi
  done

echo "[*] EC2 risk scan complete. Findings written to ${OUTPUT_FILE}"
EOF
chmod +x risk-service/ec2-risk.sh

3. RDS and Aurora Risk Detection

Databases require stricter rules than compute because mistakes become persistent immediately. A misconfigured or deleted database cannot always be recovered to the same point, and the gap between the failure and the recovery is often measured in hours of lost transactions.

The database script focuses on preventing irreversible data loss and reducing exposure. Critical findings include deletion protection disabled, backup retention set to zero, public accessibility enabled, missing encryption, and snapshots that cannot survive account failure. Storage tuning and sizing recommendations remain outside scope because they do not directly increase material business risk.

cat > risk-service/rds-risk.sh << 'EOF'
#!/usr/bin/env bash
# rds-risk.sh - Material RDS and Aurora risk detection
# Requires: aws cli v2, jq

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/rds-findings.txt"
mkdir -p findings

emit_finding() {
  local severity="$1" risk_type="$2" resource="$3" finding="$4" evidence="$5" recommendation="$6"
  echo "${severity}|${risk_type}|${ACCOUNT}|${REGION}|rds|${resource}|${finding}|${evidence}|${recommendation}" \
    | tee -a "$OUTPUT_FILE"
}

echo "=== RDS Material Risk Scan: account=${ACCOUNT} region=${REGION} ===" | tee "$OUTPUT_FILE"

# -----------------------------------------------------------------------
# 1. RDS instances - deletion protection, backup retention, public access, encryption
# -----------------------------------------------------------------------
echo "[*] Scanning RDS DB instances..."

aws rds describe-db-instances \
  --query 'DBInstances[*]' \
  --output json | jq -c '.[]' | while IFS= read -r db; do
    db_id=$(echo "$db" | jq -r '.DBInstanceIdentifier')
    deletion_protection=$(echo "$db" | jq -r '.DeletionProtection')
    backup_retention=$(echo "$db" | jq -r '.BackupRetentionPeriod')
    publicly_accessible=$(echo "$db" | jq -r '.PubliclyAccessible')
    storage_encrypted=$(echo "$db" | jq -r '.StorageEncrypted')
    multi_az=$(echo "$db" | jq -r '.MultiAZ')
    engine=$(echo "$db" | jq -r '.Engine')
    status=$(echo "$db" | jq -r '.DBInstanceStatus')

    # Only evaluate available instances
    [ "$status" != "available" ] && continue

    if [ "$deletion_protection" = "false" ]; then
      emit_finding "CRITICAL" "DATA_LOSS" "$db_id" \
        "Deletion protection is disabled on RDS instance" \
        "DeletionProtection=false,Engine=${engine}" \
        "Enable deletion protection: aws rds modify-db-instance --db-instance-identifier ${db_id} --deletion-protection --apply-immediately"
    fi

    if [ "$backup_retention" = "0" ]; then
      emit_finding "CRITICAL" "DATA_LOSS" "$db_id" \
        "Automated backup retention is set to zero days - point-in-time recovery is unavailable" \
        "BackupRetentionPeriod=0,Engine=${engine}" \
        "Set backup retention to at least 7 days for production databases"
    elif [ "$backup_retention" -lt 7 ] 2>/dev/null; then
      emit_finding "HIGH" "DATA_LOSS" "$db_id" \
        "Automated backup retention is only ${backup_retention} days - recovery window is insufficient" \
        "BackupRetentionPeriod=${backup_retention},Engine=${engine}" \
        "Increase backup retention to a minimum of 7 days"
    fi

    if [ "$publicly_accessible" = "true" ]; then
      emit_finding "CRITICAL" "SECURITY_EXPOSURE" "$db_id" \
        "RDS instance is publicly accessible via the internet" \
        "PubliclyAccessible=true,Engine=${engine}" \
        "Disable public accessibility and place the instance in a private subnet"
    fi

    if [ "$storage_encrypted" = "false" ]; then
      emit_finding "HIGH" "SECURITY_EXPOSURE" "$db_id" \
        "RDS instance storage is not encrypted at rest" \
        "StorageEncrypted=false,Engine=${engine}" \
        "Migrate to an encrypted instance using a snapshot and restore workflow"
    fi
  done

# -----------------------------------------------------------------------
# 2. Aurora clusters - deletion protection, backup retention, encryption
# -----------------------------------------------------------------------
echo "[*] Scanning Aurora clusters..."

aws rds describe-db-clusters \
  --query 'DBClusters[*]' \
  --output json | jq -c '.[]' | while IFS= read -r cluster; do
    cluster_id=$(echo "$cluster" | jq -r '.DBClusterIdentifier')
    deletion_protection=$(echo "$cluster" | jq -r '.DeletionProtection')
    backup_retention=$(echo "$cluster" | jq -r '.BackupRetentionPeriod')
    storage_encrypted=$(echo "$cluster" | jq -r '.StorageEncrypted')
    engine=$(echo "$cluster" | jq -r '.Engine')
    status=$(echo "$cluster" | jq -r '.Status')

    [ "$status" != "available" ] && continue

    if [ "$deletion_protection" = "false" ]; then
      emit_finding "CRITICAL" "DATA_LOSS" "$cluster_id" \
        "Deletion protection is disabled on Aurora cluster" \
        "DeletionProtection=false,Engine=${engine}" \
        "Enable deletion protection: aws rds modify-db-cluster --db-cluster-identifier ${cluster_id} --deletion-protection --apply-immediately"
    fi

    if [ "$backup_retention" -lt 7 ] 2>/dev/null; then
      emit_finding "HIGH" "DATA_LOSS" "$cluster_id" \
        "Aurora cluster backup retention is only ${backup_retention} days" \
        "BackupRetentionPeriod=${backup_retention},Engine=${engine}" \
        "Increase backup retention to a minimum of 7 days"
    fi

    if [ "$storage_encrypted" = "false" ]; then
      emit_finding "HIGH" "SECURITY_EXPOSURE" "$cluster_id" \
        "Aurora cluster storage is not encrypted at rest" \
        "StorageEncrypted=false,Engine=${engine}" \
        "Restore from snapshot to a new encrypted cluster"
    fi
  done

# -----------------------------------------------------------------------
# 3. Check for RDS snapshots accessible to other accounts
# -----------------------------------------------------------------------
echo "[*] Checking for publicly shared RDS snapshots..."

aws rds describe-db-snapshots \
  --snapshot-type manual \
  --query 'DBSnapshots[*].DBSnapshotIdentifier' \
  --output json | jq -r '.[]' | while IFS= read -r snapshot_id; do
    restore_attrs=$(aws rds describe-db-snapshot-attributes \
      --db-snapshot-identifier "$snapshot_id" \
      --query 'DBSnapshotAttributesResult.DBSnapshotAttributes' \
      --output json 2>/dev/null)

    public=$(echo "$restore_attrs" | jq -r '
      .[] | select(.AttributeName == "restore") |
      .AttributeValues | map(select(. == "all")) | length > 0
    ')

    if [ "$public" = "true" ]; then
      emit_finding "CRITICAL" "SECURITY_EXPOSURE" "$snapshot_id" \
        "RDS snapshot is publicly shared and accessible to all AWS accounts" \
        "AttributeValues=[all]" \
        "Remove public restore access: aws rds modify-db-snapshot-attribute --db-snapshot-identifier ${snapshot_id} --attribute-name restore --values-to-remove all"
    fi
  done

echo "[*] RDS risk scan complete. Findings written to ${OUTPUT_FILE}"
EOF
chmod +x risk-service/rds-risk.sh

4. S3 Risk Detection

Object storage often becomes the foundation of backup, audit, analytics and recovery pipelines, which means mistakes in S3 frequently become system wide failures rather than isolated incidents. A publicly exposed backup bucket, an aggressively pruned versioning policy, or a lifecycle rule that silently removes current versions can propagate damage far beyond the bucket itself.

The S3 detector reports buckets where public access blocks are disabled, bucket policies expose data publicly, versioning is disabled for backup or logging buckets, object lock is missing for immutable storage use cases, and lifecycle rules aggressively remove current or historical versions. Replication configuration is also analysed because some environments accidentally replicate deletions into recovery environments.

cat > risk-service/s3-risk.sh << 'EOF'
#!/usr/bin/env bash
# s3-risk.sh - Material S3 risk detection
# Requires: aws cli v2, jq

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/s3-findings.txt"
mkdir -p findings

emit_finding() {
  local severity="$1" risk_type="$2" resource="$3" finding="$4" evidence="$5" recommendation="$6"
  echo "${severity}|${risk_type}|${ACCOUNT}|${REGION}|s3|${resource}|${finding}|${evidence}|${recommendation}" \
    | tee -a "$OUTPUT_FILE"
}

echo "=== S3 Material Risk Scan: account=${ACCOUNT} ===" | tee "$OUTPUT_FILE"

buckets=$(aws s3api list-buckets --query 'Buckets[*].Name' --output json | jq -r '.[]')

for bucket in $buckets; do
  echo "[*] Scanning bucket: ${bucket}"

  # -----------------------------------------------------------------------
  # 1. Public access block configuration
  # -----------------------------------------------------------------------
  public_access=$(aws s3api get-public-access-block \
    --bucket "$bucket" \
    --output json 2>/dev/null || echo '{}')

  block_all=$(echo "$public_access" | jq -r '
    .PublicAccessBlockConfiguration |
    if . == null then false
    else (.BlockPublicAcls and .IgnorePublicAcls and .BlockPublicPolicy and .RestrictPublicBuckets)
    end
  ')

  if [ "$block_all" != "true" ]; then
    emit_finding "HIGH" "SECURITY_EXPOSURE" "$bucket" \
      "S3 bucket does not have all public access block settings enabled" \
      "BlockPublicAcls=$(echo "$public_access" | jq -r '.PublicAccessBlockConfiguration.BlockPublicAcls // false')" \
      "Enable all four public access block settings via aws s3api put-public-access-block"
  fi

  # -----------------------------------------------------------------------
  # 2. Bucket policy exposing data publicly
  # -----------------------------------------------------------------------
  policy=$(aws s3api get-bucket-policy --bucket "$bucket" --output text 2>/dev/null || echo "")

  if [ -n "$policy" ]; then
    public_principal=$(echo "$policy" | jq -r '
      .Statement[] |
      select(.Effect == "Allow") |
      select(.Principal == "*" or .Principal.AWS == "*") |
      .Principal
    ' 2>/dev/null || echo "")

    if [ -n "$public_principal" ]; then
      emit_finding "CRITICAL" "SECURITY_EXPOSURE" "$bucket" \
        "Bucket policy grants Allow access to Principal=* (the entire internet)" \
        "Principal=*,Effect=Allow" \
        "Review and tighten bucket policy to restrict access to specific AWS principals"
    fi
  fi

  # -----------------------------------------------------------------------
  # 3. Versioning check for backup/logging/audit buckets
  # -----------------------------------------------------------------------
  name_lower=$(echo "$bucket" | tr '[:upper:]' '[:lower:]')
  is_sensitive=false
  for keyword in backup log audit archive restore recovery; do
    if echo "$name_lower" | grep -q "$keyword"; then
      is_sensitive=true
      break
    fi
  done

  if [ "$is_sensitive" = "true" ]; then
    versioning=$(aws s3api get-bucket-versioning \
      --bucket "$bucket" \
      --query 'Status' \
      --output text 2>/dev/null || echo "None")

    if [ "$versioning" != "Enabled" ]; then
      emit_finding "HIGH" "DATA_LOSS" "$bucket" \
        "Bucket with sensitive naming pattern has versioning disabled - overwrites and deletes are irreversible" \
        "VersioningStatus=${versioning}" \
        "Enable versioning: aws s3api put-bucket-versioning --bucket ${bucket} --versioning-configuration Status=Enabled"
    fi
  fi

  # -----------------------------------------------------------------------
  # 4. Lifecycle rules that aggressively expire current versions
  # -----------------------------------------------------------------------
  lifecycle=$(aws s3api get-bucket-lifecycle-configuration \
    --bucket "$bucket" \
    --output json 2>/dev/null || echo '{"Rules":[]}')

  echo "$lifecycle" | jq -c '.Rules[]?' | while IFS= read -r rule; do
    rule_id=$(echo "$rule" | jq -r '.ID // "unnamed"')
    expiry_days=$(echo "$rule" | jq -r '.Expiration.Days // 0')
    noncurrent_days=$(echo "$rule" | jq -r '.NoncurrentVersionExpiration.NoncurrentDays // 0')

    if [ "$expiry_days" -gt 0 ] && [ "$expiry_days" -lt 30 ] 2>/dev/null; then
      emit_finding "HIGH" "DATA_LOSS" "$bucket" \
        "Lifecycle rule '${rule_id}' expires current object versions after only ${expiry_days} days" \
        "Rule=${rule_id},ExpirationDays=${expiry_days}" \
        "Review lifecycle rule to ensure critical data is not being prematurely deleted"
    fi

    if [ "$noncurrent_days" -gt 0 ] && [ "$noncurrent_days" -lt 7 ] 2>/dev/null; then
      emit_finding "MEDIUM" "DATA_LOSS" "$bucket" \
        "Lifecycle rule '${rule_id}' deletes noncurrent versions after only ${noncurrent_days} days - recovery window is very narrow" \
        "Rule=${rule_id},NoncurrentDays=${noncurrent_days}" \
        "Increase noncurrent version retention to at least 30 days"
    fi
  done

  # -----------------------------------------------------------------------
  # 5. Replication that propagates deletes to recovery buckets
  # -----------------------------------------------------------------------
  replication=$(aws s3api get-bucket-replication \
    --bucket "$bucket" \
    --output json 2>/dev/null || echo '{}')

  if [ "$(echo "$replication" | jq 'has("ReplicationConfiguration")')" = "true" ]; then
    delete_marker_replication=$(echo "$replication" | jq -r '
      .ReplicationConfiguration.Rules[]? |
      select(.DeleteMarkerReplication.Status == "Enabled") |
      .ID // "unnamed"
    ')

    if [ -n "$delete_marker_replication" ]; then
      emit_finding "HIGH" "DATA_LOSS" "$bucket" \
        "Replication rule '${delete_marker_replication}' is propagating delete markers to destination bucket - deletions will replicate" \
        "DeleteMarkerReplication=Enabled,Rule=${delete_marker_replication}" \
        "Disable delete marker replication unless the destination bucket is intended to mirror deletions"
    fi
  fi

done

echo "[*] S3 risk scan complete. Findings written to ${OUTPUT_FILE}"
EOF
chmod +x risk-service/s3-risk.sh

5. EKS Risk Detection

Kubernetes changes the scale of failure in a way that no other AWS service does. One incorrect operation can remove hundreds of resources in seconds, and an automated reconciliation loop can amplify a configuration mistake across an entire cluster before any human intervention is possible.

The EKS detector concentrates on deletion, access and recovery. Production clusters should be reported when public endpoints are broadly accessible, control plane logs are disabled, node permissions are excessive, workloads can access infrastructure metadata unexpectedly, or persistent storage lacks recovery capability. GitOps platforms deserve special attention because automated reconciliation can amplify configuration mistakes, and the detector should specifically identify whether production namespaces can be pruned automatically.

cat > risk-service/eks-risk.sh << 'EOF'
#!/usr/bin/env bash
# eks-risk.sh - Material EKS risk detection
# Requires: aws cli v2, jq

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/eks-findings.txt"
mkdir -p findings

emit_finding() {
  local severity="$1" risk_type="$2" resource="$3" finding="$4" evidence="$5" recommendation="$6"
  echo "${severity}|${risk_type}|${ACCOUNT}|${REGION}|eks|${resource}|${finding}|${evidence}|${recommendation}" \
    | tee -a "$OUTPUT_FILE"
}

echo "=== EKS Material Risk Scan: account=${ACCOUNT} region=${REGION} ===" | tee "$OUTPUT_FILE"

clusters=$(aws eks list-clusters --query 'clusters' --output json | jq -r '.[]')

for cluster in $clusters; do
  echo "[*] Scanning EKS cluster: ${cluster}"

  cluster_data=$(aws eks describe-cluster --name "$cluster" --output json)

  # -----------------------------------------------------------------------
  # 1. Public endpoint with no IP restrictions
  # -----------------------------------------------------------------------
  endpoint_public=$(echo "$cluster_data" | jq -r '.cluster.resourcesVpcConfig.endpointPublicAccess')
  public_cidrs=$(echo "$cluster_data" | jq -r '.cluster.resourcesVpcConfig.publicAccessCidrs | join(",")')

  if [ "$endpoint_public" = "true" ]; then
    if echo "$public_cidrs" | grep -q "0.0.0.0/0"; then
      emit_finding "CRITICAL" "SECURITY_EXPOSURE" "$cluster" \
        "EKS cluster API endpoint is publicly accessible with no IP restrictions (0.0.0.0/0)" \
        "EndpointPublicAccess=true,PublicAccessCidrs=0.0.0.0/0" \
        "Restrict publicAccessCidrs to known IP ranges or disable public endpoint entirely"
    else
      emit_finding "MEDIUM" "SECURITY_EXPOSURE" "$cluster" \
        "EKS cluster API endpoint is publicly accessible but restricted to: ${public_cidrs}" \
        "EndpointPublicAccess=true,PublicAccessCidrs=${public_cidrs}" \
        "Verify that all listed CIDRs are intentional and necessary"
    fi
  fi

  # -----------------------------------------------------------------------
  # 2. Control plane logging disabled
  # -----------------------------------------------------------------------
  enabled_logs=$(echo "$cluster_data" | jq -r '
    .cluster.logging.clusterLogging[]? |
    select(.enabled == true) |
    .types[]
  ' | tr '\n' ',' | sed 's/,$//')

  required_logs=("api" "audit" "authenticator")
  for log_type in "${required_logs[@]}"; do
    if ! echo "$enabled_logs" | grep -q "$log_type"; then
      emit_finding "HIGH" "REDUCED_VISIBILITY" "$cluster" \
        "EKS control plane log type '${log_type}' is not enabled - incident investigation will be impaired" \
        "EnabledLogTypes=${enabled_logs:-none}" \
        "Enable all control plane log types: api, audit, authenticator, controllerManager, scheduler"
    fi
  done

  # -----------------------------------------------------------------------
  # 3. Node groups with overly broad IAM policies
  # -----------------------------------------------------------------------
  node_groups=$(aws eks list-nodegroups \
    --cluster-name "$cluster" \
    --query 'nodegroups' \
    --output json | jq -r '.[]')

  for ng in $node_groups; do
    ng_data=$(aws eks describe-nodegroup \
      --cluster-name "$cluster" \
      --nodegroup-name "$ng" \
      --output json)

    node_role=$(echo "$ng_data" | jq -r '.nodegroup.nodeRole')
    role_name=$(echo "$node_role" | cut -d'/' -f2)

    # Check for attached admin or power user policies
    attached=$(aws iam list-attached-role-policies \
      --role-name "$role_name" \
      --query 'AttachedPolicies[*].PolicyArn' \
      --output json 2>/dev/null | jq -r '.[]')

    for policy_arn in $attached; do
      if echo "$policy_arn" | grep -qE "AdministratorAccess|PowerUserAccess"; then
        emit_finding "CRITICAL" "SECURITY_EXPOSURE" "${cluster}/${ng}" \
          "EKS node group uses IAM role with broad managed policy: ${policy_arn}" \
          "NodeRole=${role_name},PolicyArn=${policy_arn}" \
          "Replace with a least privilege policy containing only the permissions the nodes require"
      fi
    done
  done

  # -----------------------------------------------------------------------
  # 4. IMDS access not restricted on node groups (metadata service exposure)
  # -----------------------------------------------------------------------
  for ng in $node_groups; do
    ng_data=$(aws eks describe-nodegroup \
      --cluster-name "$cluster" \
      --nodegroup-name "$ng" \
      --output json)

    launch_template=$(echo "$ng_data" | jq -r '.nodegroup.launchTemplate.id // empty')

    if [ -n "$launch_template" ]; then
      lt_version=$(echo "$ng_data" | jq -r '.nodegroup.launchTemplate.version // "$Default"')
      imds_tokens=$(aws ec2 describe-launch-template-versions \
        --launch-template-id "$launch_template" \
        --versions "$lt_version" \
        --query 'LaunchTemplateVersions[0].LaunchTemplateData.MetadataOptions.HttpTokens' \
        --output text 2>/dev/null || echo "optional")

      if [ "$imds_tokens" != "required" ]; then
        emit_finding "HIGH" "SECURITY_EXPOSURE" "${cluster}/${ng}" \
          "EKS node group launch template does not require IMDSv2 - workloads can access instance metadata without token auth" \
          "HttpTokens=${imds_tokens}" \
          "Set HttpTokens=required in the launch template to enforce IMDSv2"
      fi
    fi
  done

done

echo "[*] EKS risk scan complete. Findings written to ${OUTPUT_FILE}"
EOF
chmod +x risk-service/eks-risk.sh

6. Route 53 and Edge Risk Detection

DNS failures frequently appear to users as complete platform outages even when every application server is healthy, because without working name resolution traffic simply cannot reach infrastructure at all. For this reason DNS should be treated as a critical control plane asset rather than a routine configuration item.

The detector identifies hosted zones that can be modified or deleted without restriction, missing DNS exports, load balancers without deletion protection, and environments without a repeatable recovery artefact. The objective is not to judge DNS design but to determine whether traffic can disappear.

cat > risk-service/route53-risk.sh << 'EOF'
#!/usr/bin/env bash
# route53-risk.sh - Material Route 53 and edge risk detection
# Requires: aws cli v2, jq

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/route53-findings.txt"
mkdir -p findings

emit_finding() {
  local severity="$1" risk_type="$2" resource="$3" finding="$4" evidence="$5" recommendation="$6"
  echo "${severity}|${risk_type}|${ACCOUNT}|${REGION}|route53|${resource}|${finding}|${evidence}|${recommendation}" \
    | tee -a "$OUTPUT_FILE"
}

echo "=== Route 53 Material Risk Scan: account=${ACCOUNT} ===" | tee "$OUTPUT_FILE"

# -----------------------------------------------------------------------
# 1. Hosted zones without query logging enabled
# -----------------------------------------------------------------------
echo "[*] Checking hosted zones for query logging..."

zones=$(aws route53 list-hosted-zones --query 'HostedZones[*].{Id:Id,Name:Name,Private:Config.PrivateZone}' --output json)

echo "$zones" | jq -c '.[]' | while IFS= read -r zone; do
  zone_id=$(echo "$zone" | jq -r '.Id' | cut -d'/' -f3)
  zone_name=$(echo "$zone" | jq -r '.Name')
  is_private=$(echo "$zone" | jq -r '.Private')

  # Skip private zones for query logging check
  if [ "$is_private" = "false" ]; then
    logging_configs=$(aws route53 list-query-logging-configs \
      --hosted-zone-id "$zone_id" \
      --query 'QueryLoggingConfigs' \
      --output json 2>/dev/null | jq 'length')

    if [ "$logging_configs" = "0" ]; then
      emit_finding "MEDIUM" "REDUCED_VISIBILITY" "${zone_name}(${zone_id})" \
        "Public hosted zone has no query logging enabled - DNS resolution activity cannot be audited" \
        "QueryLoggingConfigs=0" \
        "Enable Route 53 query logging to CloudWatch Logs for this hosted zone"
    fi
  fi

  # -----------------------------------------------------------------------
  # 2. Zones with very few records (possible accidental deletion indicator)
  # -----------------------------------------------------------------------
  record_count=$(aws route53 list-resource-record-sets \
    --hosted-zone-id "$zone_id" \
    --query 'length(ResourceRecordSets)' \
    --output text 2>/dev/null || echo "0")

  if [ "$record_count" -lt 3 ] 2>/dev/null; then
    emit_finding "MEDIUM" "DATA_LOSS" "${zone_name}(${zone_id})" \
      "Hosted zone has only ${record_count} record(s) - zone may have had records accidentally deleted" \
      "RecordCount=${record_count}" \
      "Verify zone contents are complete and consider exporting a zone file as a recovery artefact"
  fi

done

# -----------------------------------------------------------------------
# 3. ALB / NLB load balancers without deletion protection
# -----------------------------------------------------------------------
echo "[*] Checking load balancers for deletion protection..."

aws elbv2 describe-load-balancers \
  --query 'LoadBalancers[*].LoadBalancerArn' \
  --output json | jq -r '.[]' | while IFS= read -r lb_arn; do
    lb_name=$(echo "$lb_arn" | awk -F':' '{print $NF}' | cut -d'/' -f2)

    deletion_protection=$(aws elbv2 describe-load-balancer-attributes \
      --load-balancer-arn "$lb_arn" \
      --query 'Attributes[?Key==`deletion_protection.enabled`].Value' \
      --output text 2>/dev/null || echo "false")

    if [ "$deletion_protection" != "true" ]; then
      emit_finding "HIGH" "ACCIDENTAL_DELETION" "$lb_name" \
        "Load balancer does not have deletion protection enabled" \
        "DeletionProtection=false,ARN=${lb_arn}" \
        "Enable deletion protection: aws elbv2 modify-load-balancer-attributes --load-balancer-arn ${lb_arn} --attributes Key=deletion_protection.enabled,Value=true"
    fi
  done

# -----------------------------------------------------------------------
# 4. DNSSEC not enabled on public hosted zones
# -----------------------------------------------------------------------
echo "[*] Checking DNSSEC status on public hosted zones..."

aws route53 list-hosted-zones \
  --query 'HostedZones[?Config.PrivateZone==`false`].{Id:Id,Name:Name}' \
  --output json | jq -c '.[]' | while IFS= read -r zone; do
    zone_id=$(echo "$zone" | jq -r '.Id' | cut -d'/' -f3)
    zone_name=$(echo "$zone" | jq -r '.Name')

    dnssec_status=$(aws route53 get-dnssec \
      --hosted-zone-id "$zone_id" \
      --query 'Status.ServeSignature' \
      --output text 2>/dev/null || echo "NOT_SIGNING")

    if [ "$dnssec_status" != "SIGNING" ]; then
      emit_finding "MEDIUM" "SECURITY_EXPOSURE" "${zone_name}(${zone_id})" \
        "Public hosted zone does not have DNSSEC signing enabled - DNS spoofing attacks are possible" \
        "ServeSignature=${dnssec_status}" \
        "Enable DNSSEC signing via aws route53 enable-hosted-zone-dnssec"
    fi
  done

echo "[*] Route 53 risk scan complete. Findings written to ${OUTPUT_FILE}"
EOF
chmod +x risk-service/route53-risk.sh

7. KMS and Cryptographic Risk Detection

Cryptographic systems behave differently from almost every other platform component. Infrastructure can usually be rebuilt from code. Lost encryption keys often cannot be recovered by any means, and when they are lost the data they protected becomes permanently unreadable. The important question is never whether a key exists but whether data could become permanently unreadable.

The detector identifies keys scheduled for deletion, broad administrative permissions granted to non break glass roles, dependencies between backups and removable keys, and recovery processes that rely on a single identity.

cat > risk-service/kms-risk.sh << 'EOF'
#!/usr/bin/env bash
# kms-risk.sh - Material KMS and cryptographic risk detection
# Requires: aws cli v2, jq

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/kms-findings.txt"
mkdir -p findings

emit_finding() {
  local severity="$1" risk_type="$2" resource="$3" finding="$4" evidence="$5" recommendation="$6"
  echo "${severity}|${risk_type}|${ACCOUNT}|${REGION}|kms|${resource}|${finding}|${evidence}|${recommendation}" \
    | tee -a "$OUTPUT_FILE"
}

echo "=== KMS Material Risk Scan: account=${ACCOUNT} region=${REGION} ===" | tee "$OUTPUT_FILE"

keys=$(aws kms list-keys --query 'Keys[*].KeyId' --output json | jq -r '.[]')

for key_id in $keys; do
  key_data=$(aws kms describe-key --key-id "$key_id" --output json 2>/dev/null)
  key_state=$(echo "$key_data" | jq -r '.KeyMetadata.KeyState')
  key_manager=$(echo "$key_data" | jq -r '.KeyMetadata.KeyManager')
  key_alias=$(aws kms list-aliases --key-id "$key_id" --query 'Aliases[0].AliasName' --output text 2>/dev/null || echo "no-alias")
  key_label="${key_alias}(${key_id})"

  # Skip AWS managed keys - these are not customer controlled
  [ "$key_manager" = "AWS" ] && continue

  # -----------------------------------------------------------------------
  # 1. Keys scheduled for deletion
  # -----------------------------------------------------------------------
  if [ "$key_state" = "PendingDeletion" ]; then
    deletion_date=$(echo "$key_data" | jq -r '.KeyMetadata.DeletionDate // "unknown"')
    emit_finding "CRITICAL" "CRYPTOGRAPHIC_LOSS" "$key_label" \
      "Customer managed key is scheduled for deletion on ${deletion_date} - all data encrypted by this key will become permanently unreadable" \
      "KeyState=PendingDeletion,DeletionDate=${deletion_date}" \
      "Cancel key deletion if any active data depends on this key: aws kms cancel-key-deletion --key-id ${key_id}"
  fi

  # -----------------------------------------------------------------------
  # 2. Keys with key rotation disabled
  # -----------------------------------------------------------------------
  if [ "$key_state" = "Enabled" ]; then
    rotation=$(aws kms get-key-rotation-status \
      --key-id "$key_id" \
      --query 'KeyRotationEnabled' \
      --output text 2>/dev/null || echo "false")

    if [ "$rotation" = "False" ]; then
      emit_finding "MEDIUM" "SECURITY_EXPOSURE" "$key_label" \
        "Customer managed key does not have automatic annual rotation enabled" \
        "KeyRotationEnabled=false" \
        "Enable rotation: aws kms enable-key-rotation --key-id ${key_id}"
    fi
  fi

  # -----------------------------------------------------------------------
  # 3. Key policy grants kms:ScheduleKeyDeletion broadly
  # -----------------------------------------------------------------------
  policy=$(aws kms get-key-policy \
    --key-id "$key_id" \
    --policy-name default \
    --output text 2>/dev/null || echo "{}")

  broad_delete=$(echo "$policy" | jq -r '
    .Statement[]? |
    select(.Effect == "Allow") |
    select(
      (.Action == "kms:*") or
      (.Action | arrays | map(select(. == "kms:ScheduleKeyDeletion")) | length > 0) or
      (.Action | strings | test("kms:ScheduleKeyDeletion"))
    ) |
    select(.Principal.AWS != ("arn:aws:iam::" + env.ACCOUNT + ":root")) |
    .Principal.AWS
  ' 2>/dev/null)

  if [ -n "$broad_delete" ]; then
    emit_finding "HIGH" "CRYPTOGRAPHIC_LOSS" "$key_label" \
      "Key policy grants kms:ScheduleKeyDeletion to a principal other than the account root: ${broad_delete}" \
      "Principal=${broad_delete}" \
      "Review key policy and restrict ScheduleKeyDeletion to break glass roles only"
  fi

done

echo "[*] KMS risk scan complete. Findings written to ${OUTPUT_FILE}"
EOF
chmod +x risk-service/kms-risk.sh

8. Backup Risk Detection

Backups create confidence. Restore testing creates safety. The difference between the two is enormous because an untested backup is an assumption rather than a proven capability. Many organisations discover during an actual incident that their backup process was silently failing, that vault locks were never configured, or that the recovery path they believed existed had a dependency they had not anticipated.

The detector reports backup vaults without lock protection, recovery points that can be deleted, missing cross account recovery paths, and environments without recent restore evidence. The goal is not to confirm backups exist but to prove the organisation can survive mistakes.

cat > risk-service/backup-risk.sh << 'EOF'
#!/usr/bin/env bash
# backup-risk.sh - Material AWS Backup risk detection
# Requires: aws cli v2, jq

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/backup-findings.txt"
DAYS_TO_CHECK=30
mkdir -p findings

emit_finding() {
  local severity="$1" risk_type="$2" resource="$3" finding="$4" evidence="$5" recommendation="$6"
  echo "${severity}|${risk_type}|${ACCOUNT}|${REGION}|backup|${resource}|${finding}|${evidence}|${recommendation}" \
    | tee -a "$OUTPUT_FILE"
}

echo "=== Backup Material Risk Scan: account=${ACCOUNT} region=${REGION} ===" | tee "$OUTPUT_FILE"

# -----------------------------------------------------------------------
# 1. Backup vaults without vault lock
# -----------------------------------------------------------------------
echo "[*] Checking backup vault lock status..."

aws backup list-backup-vaults \
  --query 'BackupVaultList[*].BackupVaultName' \
  --output json | jq -r '.[]' | while IFS= read -r vault; do
    lock_config=$(aws backup describe-backup-vault \
      --backup-vault-name "$vault" \
      --query '{Locked:Locked,MinRetention:MinRetentionDays,MaxRetention:MaxRetentionDays}' \
      --output json 2>/dev/null || echo '{"Locked":false}')

    locked=$(echo "$lock_config" | jq -r '.Locked // false')

    if [ "$locked" = "false" ]; then
      emit_finding "HIGH" "DATA_LOSS" "$vault" \
        "Backup vault does not have vault lock enabled - recovery points can be deleted by any principal with sufficient IAM permissions" \
        "Locked=false" \
        "Configure vault lock with a minimum retention period to prevent recovery point deletion"
    fi
  done

# -----------------------------------------------------------------------
# 2. Backup jobs that have failed in the past 30 days
# -----------------------------------------------------------------------
echo "[*] Checking for failed backup jobs in last ${DAYS_TO_CHECK} days..."

start_date=$(date -u -d "${DAYS_TO_CHECK} days ago" '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null \
  || date -u -v-${DAYS_TO_CHECK}d '+%Y-%m-%dT%H:%M:%SZ')

failed_jobs=$(aws backup list-backup-jobs \
  --by-state FAILED \
  --by-created-after "$start_date" \
  --query 'BackupJobs[*].{Id:BackupJobId,Resource:ResourceArn,Vault:BackupVaultName,Created:CreationDate}' \
  --output json | jq 'length')

if [ "$failed_jobs" -gt 0 ]; then
  emit_finding "HIGH" "DATA_LOSS" "account-level" \
    "${failed_jobs} backup job(s) have failed in the last ${DAYS_TO_CHECK} days - affected resources may not have valid recovery points" \
    "FailedJobs=${failed_jobs},Period=${DAYS_TO_CHECK}days" \
    "Review failed backup jobs: aws backup list-backup-jobs --by-state FAILED --by-created-after ${start_date}"
fi

# -----------------------------------------------------------------------
# 3. Resources without a backup plan
# -----------------------------------------------------------------------
echo "[*] Checking for unprotected RDS instances..."

all_rds=$(aws rds describe-db-instances \
  --query 'DBInstances[*].DBInstanceArn' \
  --output json | jq -r '.[]')

for rds_arn in $all_rds; do
  recovery_points=$(aws backup list-recovery-points-by-resource \
    --resource-arn "$rds_arn" \
    --query 'RecoveryPoints | length(@)' \
    --output text 2>/dev/null || echo "0")

  if [ "$recovery_points" = "0" ]; then
    rds_id=$(echo "$rds_arn" | awk -F':' '{print $NF}')
    emit_finding "HIGH" "DATA_LOSS" "$rds_id" \
      "RDS instance has no recovery points in AWS Backup - no proven restore capability exists" \
      "RecoveryPoints=0,ARN=${rds_arn}" \
      "Create a backup plan that includes this resource, or verify native RDS automated backups are configured"
  fi
done

# -----------------------------------------------------------------------
# 4. No cross account copy rules (single account backup is not a recovery strategy)
# -----------------------------------------------------------------------
echo "[*] Checking for cross-account backup copy rules..."

plans_with_cross_account=$(aws backup list-backup-plans \
  --query 'BackupPlansList[*].BackupPlanId' \
  --output json | jq -r '.[]' | while IFS= read -r plan_id; do
    aws backup get-backup-plan \
      --backup-plan-id "$plan_id" \
      --query 'BackupPlan.Rules[*].CopyActions[*].DestinationBackupVaultArn' \
      --output json 2>/dev/null | jq -r '.[][]?' | grep -v "^$" | grep -c "arn:aws:backup" || true
  done | paste -s -d'+' - | bc 2>/dev/null || echo "0")

if [ "${plans_with_cross_account:-0}" = "0" ]; then
  emit_finding "HIGH" "DATA_LOSS" "account-level" \
    "No backup plans contain cross-account copy rules - all recovery points are in the same account and could be lost in an account compromise" \
    "CrossAccountCopyRules=0" \
    "Add cross-account copy actions to backup plans targeting a separate recovery account"
fi

echo "[*] Backup risk scan complete. Findings written to ${OUTPUT_FILE}"
EOF
chmod +x risk-service/backup-risk.sh

9. IAM Risk Detection

Identity is often the highest leverage detector in the entire platform because most outages begin with valid credentials performing valid operations. A developer running a cleanup script with an overpermissioned role, an automation pipeline with credentials that were never scoped, or a CI pipeline with the ability to modify its own IAM permissions represent the most common failure pattern in cloud environments. Cyera’s 2025 enterprise AWS telemetry analysis found that IAM misconfiguration and plaintext secrets were the two most pervasive recurring risks across large AWS estates, and that scale itself creates the blind spots because teams assume preventive controls will catch what runtime visibility does not. AWS’s guidance on using SCPs as permission guardrails emphasises that IAM alone is insufficient at scale and that organisation-level controls are required to prevent accounts from undermining their own boundaries.

The IAM detector identifies non break glass roles that can delete databases, remove backups, modify DNS, stop audit logging, destroy clusters, remove networks, schedule cryptographic deletion, or delete infrastructure stacks. The existence of the permission becomes the finding. There is no requirement for someone to admit the risk.

cat > risk-service/iam-risk.sh << 'EOF'
#!/usr/bin/env bash
# iam-risk.sh - Material IAM risk detection
# Requires: aws cli v2, jq

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/iam-findings.txt"
mkdir -p findings

emit_finding() {
  local severity="$1" risk_type="$2" resource="$3" finding="$4" evidence="$5" recommendation="$6"
  echo "${severity}|${risk_type}|${ACCOUNT}|${REGION}|iam|${resource}|${finding}|${evidence}|${recommendation}" \
    | tee -a "$OUTPUT_FILE"
}

echo "=== IAM Material Risk Scan: account=${ACCOUNT} ===" | tee "$OUTPUT_FILE"

# Permissions that can cause material business harm if misused
DANGEROUS_ACTIONS=(
  "rds:DeleteDBInstance"
  "rds:DeleteDBCluster"
  "rds:ModifyDBInstance"
  "backup:DeleteBackupVault"
  "backup:DeleteRecoveryPoint"
  "route53:DeleteHostedZone"
  "route53:ChangeResourceRecordSets"
  "cloudtrail:StopLogging"
  "cloudtrail:DeleteTrail"
  "eks:DeleteCluster"
  "kms:ScheduleKeyDeletion"
  "kms:DisableKey"
  "ec2:DeleteVpc"
  "ec2:DeleteSubnet"
  "cloudformation:DeleteStack"
  "s3:DeleteBucket"
  "iam:AttachRolePolicy"
  "iam:PutRolePolicy"
  "iam:CreatePolicyVersion"
  "sts:AssumeRole"
)

# -----------------------------------------------------------------------
# 1. Roles with administrator access that are not explicitly break-glass
# -----------------------------------------------------------------------
echo "[*] Checking for non break-glass roles with AdministratorAccess..."

aws iam list-roles --query 'Roles[*].{RoleName:RoleName,Arn:Arn}' --output json | jq -c '.[]' | while IFS= read -r role; do
  role_name=$(echo "$role" | jq -r '.RoleName')
  role_arn=$(echo "$role" | jq -r '.Arn')

  # Skip service roles and roles named break-glass / emergency
  if echo "$role_name" | grep -qiE "break.?glass|emergency|breakglass"; then
    continue
  fi

  attached_policies=$(aws iam list-attached-role-policies \
    --role-name "$role_name" \
    --query 'AttachedPolicies[*].PolicyArn' \
    --output json 2>/dev/null | jq -r '.[]')

  for policy_arn in $attached_policies; do
    if echo "$policy_arn" | grep -q "AdministratorAccess"; then
      # Check for trust policy - identify who can assume this role
      trust_policy=$(aws iam get-role \
        --role-name "$role_name" \
        --query 'Role.AssumeRolePolicyDocument' \
        --output json 2>/dev/null)

      principals=$(echo "$trust_policy" | jq -r '.Statement[].Principal | [.AWS, .Service, .Federated] | flatten | .[]? // empty' 2>/dev/null | tr '\n' ',')

      emit_finding "CRITICAL" "PRIVILEGE_ESCALATION" "$role_name" \
        "Non break-glass role has AdministratorAccess policy attached - any principal that can assume this role has full account control" \
        "PolicyArn=${policy_arn},TrustPrincipals=${principals}" \
        "Replace AdministratorAccess with a scoped policy or restrict assume role trust to break-glass procedures only"
    fi
  done
done

# -----------------------------------------------------------------------
# 2. IAM users with active access keys older than 90 days
# -----------------------------------------------------------------------
echo "[*] Checking IAM user access key age..."

aws iam list-users --query 'Users[*].UserName' --output json | jq -r '.[]' | while IFS= read -r user; do
  aws iam list-access-keys \
    --user-name "$user" \
    --query 'AccessKeyMetadata[*]' \
    --output json | jq -c '.[]' | while IFS= read -r key; do
      key_id=$(echo "$key" | jq -r '.AccessKeyId')
      key_status=$(echo "$key" | jq -r '.Status')
      created=$(echo "$key" | jq -r '.CreateDate')

      if [ "$key_status" = "Active" ]; then
        # Calculate age in days (cross-platform)
        created_epoch=$(date -d "$created" +%s 2>/dev/null || date -j -f "%Y-%m-%dT%H:%M:%S+00:00" "$created" +%s 2>/dev/null || echo "0")
        now_epoch=$(date +%s)
        age_days=$(( (now_epoch - created_epoch) / 86400 ))

        if [ "$age_days" -gt 90 ] 2>/dev/null; then
          emit_finding "HIGH" "SECURITY_EXPOSURE" "${user}" \
            "IAM user has active access key ${key_id} that is ${age_days} days old - stale credentials increase the impact radius of a credential leak" \
            "KeyId=${key_id},AgeDays=${age_days},Status=Active" \
            "Rotate or deactivate access key: aws iam update-access-key --user-name ${user} --access-key-id ${key_id} --status Inactive"
        fi
      fi
  done
done

# -----------------------------------------------------------------------
# 3. Roles with inline policies granting dangerous actions with wildcard resources
# -----------------------------------------------------------------------
echo "[*] Checking inline role policies for dangerous wildcard permissions..."

aws iam list-roles --query 'Roles[*].RoleName' --output json | jq -r '.[]' | while IFS= read -r role_name; do
  inline_policies=$(aws iam list-role-policies \
    --role-name "$role_name" \
    --query 'PolicyNames' \
    --output json | jq -r '.[]')

  for policy_name in $inline_policies; do
    policy_doc=$(aws iam get-role-policy \
      --role-name "$role_name" \
      --policy-name "$policy_name" \
      --query 'PolicyDocument' \
      --output json 2>/dev/null)

    for action in "${DANGEROUS_ACTIONS[@]}"; do
      service=$(echo "$action" | cut -d':' -f1)
      op=$(echo "$action" | cut -d':' -f2)

      match=$(echo "$policy_doc" | jq -r --arg action "$action" --arg service "${service}:*" '
        .Statement[]? |
        select(.Effect == "Allow") |
        select(.Resource == "*" or (.Resource | arrays | map(. == "*") | any)) |
        select(
          (.Action == $action) or (.Action == $service) or
          (.Action | arrays | map(. == $action or . == $service) | any)
        ) |
        .Action
      ' 2>/dev/null)

      if [ -n "$match" ]; then
        emit_finding "HIGH" "PRIVILEGE_ABUSE" "${role_name}/${policy_name}" \
          "Inline policy grants ${action} on Resource=* - this permission can cause material business harm" \
          "Action=${action},Resource=*,Policy=${policy_name}" \
          "Scope the resource constraint to specific ARNs or remove the permission if not required"
      fi
    done
  done
done

echo "[*] IAM risk scan complete. Findings written to ${OUTPUT_FILE}"
EOF
chmod +x risk-service/iam-risk.sh

10. CloudTrail and Control Plane Visibility

Every recovery process depends on understanding what happened. Without audit evidence, an outage transforms from a manageable incident into guesswork, and the difference between a two hour recovery and a two day recovery often comes down entirely to whether logs exist. This area should remain small but non negotiable in every environment. AWS’s VPC Flow Logs documentation treats Flow Log absence as a Config rule violation, and the AWS Well-Architected Security Pillar requires network level logging as a foundational detective control, meaning an environment without these in place fails its own framework review before any external assessment begins.

The detector reports disabled CloudTrail, missing multi region logging, disabled configuration recording, and environments where audit evidence can disappear.

cat > risk-service/cloudtrail-risk.sh << 'EOF'
#!/usr/bin/env bash
# cloudtrail-risk.sh - Control plane visibility risk detection
# Requires: aws cli v2, jq

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/cloudtrail-findings.txt"
mkdir -p findings

emit_finding() {
  local severity="$1" risk_type="$2" resource="$3" finding="$4" evidence="$5" recommendation="$6"
  echo "${severity}|${risk_type}|${ACCOUNT}|${REGION}|cloudtrail|${resource}|${finding}|${evidence}|${recommendation}" \
    | tee -a "$OUTPUT_FILE"
}

echo "=== CloudTrail Visibility Risk Scan: account=${ACCOUNT} region=${REGION} ===" | tee "$OUTPUT_FILE"

# -----------------------------------------------------------------------
# 1. No active multi-region trail exists
# -----------------------------------------------------------------------
echo "[*] Checking for active multi-region CloudTrail..."

trails=$(aws cloudtrail describe-trails \
  --include-shadow-trails false \
  --query 'trailList' \
  --output json)

multi_region_active=$(echo "$trails" | jq '
  [.[] | select(.IsMultiRegionTrail == true)] |
  length
')

if [ "$multi_region_active" = "0" ]; then
  emit_finding "CRITICAL" "REDUCED_VISIBILITY" "account-level" \
    "No active multi-region CloudTrail exists - API activity in some regions will not be logged" \
    "MultiRegionTrails=0" \
    "Create a multi-region trail that delivers to an S3 bucket with log file validation enabled"
fi

# -----------------------------------------------------------------------
# 2. Trail log file validation disabled
# -----------------------------------------------------------------------
echo "$trails" | jq -c '.[]' | while IFS= read -r trail; do
  trail_name=$(echo "$trail" | jq -r '.Name')
  log_validation=$(echo "$trail" | jq -r '.LogFileValidationEnabled')
  is_logging=$(aws cloudtrail get-trail-status \
    --name "$trail_name" \
    --query 'IsLogging' \
    --output text 2>/dev/null || echo "false")

  if [ "$is_logging" = "False" ]; then
    emit_finding "CRITICAL" "REDUCED_VISIBILITY" "$trail_name" \
      "CloudTrail trail is not currently logging - API activity is not being recorded" \
      "IsLogging=false" \
      "Start logging: aws cloudtrail start-logging --name ${trail_name}"
  fi

  if [ "$log_validation" = "false" ]; then
    emit_finding "HIGH" "REDUCED_VISIBILITY" "$trail_name" \
      "CloudTrail log file validation is disabled - log tampering cannot be detected" \
      "LogFileValidationEnabled=false" \
      "Enable log file validation: aws cloudtrail update-trail --name ${trail_name} --enable-log-file-validation"
  fi
done

# -----------------------------------------------------------------------
# 3. AWS Config recorder disabled
# -----------------------------------------------------------------------
echo "[*] Checking AWS Config recorder status..."

config_recorder=$(aws configservice describe-configuration-recorder-status \
  --query 'ConfigurationRecordersStatus[0]' \
  --output json 2>/dev/null || echo '{}')

recording=$(echo "$config_recorder" | jq -r '.recording // false')

if [ "$recording" = "false" ]; then
  emit_finding "HIGH" "REDUCED_VISIBILITY" "account-level" \
    "AWS Config recorder is not running - resource configuration history is unavailable and compliance rules cannot evaluate" \
    "Recording=false" \
    "Start the Config recorder: aws configservice start-configuration-recorder --configuration-recorder-name default"
fi

# -----------------------------------------------------------------------
# 4. CloudTrail S3 bucket lacks MFA delete or versioning
# -----------------------------------------------------------------------
echo "[*] Checking CloudTrail S3 bucket protection..."

echo "$trails" | jq -r '.[].S3BucketName // empty' | sort -u | while IFS= read -r bucket; do
  versioning=$(aws s3api get-bucket-versioning \
    --bucket "$bucket" \
    --output json 2>/dev/null || echo '{}')

  versioning_status=$(echo "$versioning" | jq -r '.Status // "Disabled"')
  mfa_delete=$(echo "$versioning" | jq -r '.MFADelete // "Disabled"')

  if [ "$versioning_status" != "Enabled" ]; then
    emit_finding "HIGH" "REDUCED_VISIBILITY" "$bucket" \
      "CloudTrail log bucket does not have versioning enabled - logs can be overwritten or deleted without recovery" \
      "VersioningStatus=${versioning_status}" \
      "Enable versioning on the CloudTrail bucket to protect audit evidence"
  fi

  if [ "$mfa_delete" != "Enabled" ]; then
    emit_finding "MEDIUM" "REDUCED_VISIBILITY" "$bucket" \
      "CloudTrail log bucket does not have MFA delete enabled - logs can be permanently deleted with only IAM credentials" \
      "MFADelete=${mfa_delete}" \
      "Enable MFA delete on the CloudTrail bucket for additional protection of audit evidence"
    fi
done

echo "[*] CloudTrail visibility scan complete. Findings written to ${OUTPUT_FILE}"
EOF
chmod +x risk-service/cloudtrail-risk.sh

11. VPC and Network Risk Detection

Network configuration sits beneath every other service in the stack, which means a gap here does not produce one finding but potentially amplifies every other risk category. The two most persistent network risks in AWS environments are VPC Flow Logs that are not enabled and default VPCs that remain in use. AWS’s own Config managed rule treats Flow Log absence as a compliance failure, and the AWS Well-Architected Framework requires network-level logging as a foundational detective control. Default VPCs arrive with broadly permissive configurations and are rarely reviewed; production workloads placed in them inherit that posture without any deliberate security decision having been made.

The detector checks that every VPC in the account has Flow Logs actively delivering to a log destination, that no production-tagged resources are running inside default VPCs, and that no internet gateways are attached to VPCs whose workloads have no business requiring internet routing. Broad Network ACL rules that permit all traffic are also reported, since a permissive NACL silently overrides security group restrictions that teams believe are protecting their workloads. Tenable’s 2025 Cloud Security Risk Report identified network visibility gaps as one of the top four recurring misconfiguration categories across enterprise cloud environments.

cat > risk-service/vpc-risk.sh << 'EOF'
#!/usr/bin/env bash
# vpc-risk.sh - Material VPC and network risk detection
# Requires: aws cli v2, jq

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/vpc-findings.txt"
mkdir -p findings

emit_finding() {
  local severity="$1" risk_type="$2" resource="$3" finding="$4" evidence="$5" recommendation="$6"
  echo "${severity}|${risk_type}|${ACCOUNT}|${REGION}|vpc|${resource}|${finding}|${evidence}|${recommendation}" \
    | tee -a "$OUTPUT_FILE"
}

echo "=== VPC Material Risk Scan: account=${ACCOUNT} region=${REGION} ===" | tee "$OUTPUT_FILE"

# -----------------------------------------------------------------------
# 1. VPCs without Flow Logs enabled
# -----------------------------------------------------------------------
echo "[*] Checking VPCs for Flow Log coverage..."

aws ec2 describe-vpcs \
  --query 'Vpcs[*].{VpcId:VpcId,IsDefault:IsDefault,Tags:Tags}' \
  --output json | jq -c '.[]' | while IFS= read -r vpc; do
    vpc_id=$(echo "$vpc" | jq -r '.VpcId')
    is_default=$(echo "$vpc" | jq -r '.IsDefault')

    flow_log_count=$(aws ec2 describe-flow-logs \
      --filter "Name=resource-id,Values=${vpc_id}" \
      --query 'FlowLogs[?FlowLogStatus==`ACTIVE`] | length(@)' \
      --output text 2>/dev/null || echo "0")

    if [ "${flow_log_count:-0}" = "0" ]; then
      severity="HIGH"
      [ "$is_default" = "true" ] && severity="MEDIUM"
      emit_finding "$severity" "REDUCED_VISIBILITY" "$vpc_id" \
        "VPC has no active Flow Logs - network traffic analysis and incident investigation are impossible" \
        "FlowLogs=0,IsDefault=${is_default}" \
        "Enable VPC Flow Logs delivering to CloudWatch Logs or S3: aws ec2 create-flow-logs --resource-type VPC --resource-ids ${vpc_id} --traffic-type ALL --log-destination-type cloud-watch-logs"
    fi
  done

# -----------------------------------------------------------------------
# 2. Default VPCs with running instances
# -----------------------------------------------------------------------
echo "[*] Checking default VPCs for active workloads..."

default_vpcs=$(aws ec2 describe-vpcs \
  --filters "Name=isDefault,Values=true" \
  --query 'Vpcs[*].VpcId' \
  --output json | jq -r '.[]')

for vpc_id in $default_vpcs; do
  instance_count=$(aws ec2 describe-instances \
    --filters "Name=vpc-id,Values=${vpc_id}" "Name=instance-state-name,Values=running" \
    --query 'Reservations[*].Instances[*].InstanceId' \
    --output json | jq 'flatten | length')

  if [ "${instance_count:-0}" -gt 0 ] 2>/dev/null; then
    emit_finding "HIGH" "SECURITY_EXPOSURE" "$vpc_id" \
      "Default VPC has ${instance_count} running instance(s) - default VPCs carry permissive configurations that were never deliberately reviewed for production use" \
      "IsDefault=true,RunningInstances=${instance_count}" \
      "Migrate workloads to a purpose-built VPC with explicit subnet, routing, and security group design"
  fi
done

# -----------------------------------------------------------------------
# 3. Internet gateways attached to VPCs with no public subnets
# -----------------------------------------------------------------------
echo "[*] Checking for unnecessary internet gateway attachments..."

aws ec2 describe-internet-gateways \
  --query 'InternetGateways[?Attachments[?State==`available`]]' \
  --output json | jq -c '.[]' | while IFS= read -r igw; do
    igw_id=$(echo "$igw" | jq -r '.InternetGatewayId')
    vpc_id=$(echo "$igw" | jq -r '.Attachments[0].VpcId // empty')

    [ -z "$vpc_id" ] && continue

    # Check whether any route table in the VPC routes 0.0.0.0/0 to this IGW
    igw_routes=$(aws ec2 describe-route-tables \
      --filters "Name=vpc-id,Values=${vpc_id}" \
      --query "RouteTables[*].Routes[?GatewayId=='${igw_id}' && DestinationCidrBlock=='0.0.0.0/0'] | length(@)" \
      --output text 2>/dev/null || echo "0")

    # Find instances in the VPC - if there are instances but no routes using IGW it may be orphaned
    # More importantly: flag if private subnets have routes to IGW directly
    private_subnets_with_igw=$(aws ec2 describe-route-tables \
      --filters "Name=vpc-id,Values=${vpc_id}" \
      --query "RouteTables[?!contains(Associations[].Main, \`true\`)].Routes[?GatewayId=='${igw_id}' && DestinationCidrBlock=='0.0.0.0/0'] | length(@)" \
      --output text 2>/dev/null || echo "0")

    if [ "${private_subnets_with_igw:-0}" -gt 0 ] 2>/dev/null; then
      emit_finding "HIGH" "SECURITY_EXPOSURE" "${vpc_id}/${igw_id}" \
        "Non-main route table(s) in VPC route 0.0.0.0/0 directly to internet gateway - these subnets have unintended internet exposure" \
        "IGW=${igw_id},PrivateSubnetRoutes=${private_subnets_with_igw}" \
        "Review route tables and remove direct IGW routes from subnets that should be private"
    fi
  done

# -----------------------------------------------------------------------
# 4. Network ACLs with allow-all rules on sensitive ports
# -----------------------------------------------------------------------
echo "[*] Checking Network ACLs for overly permissive inbound rules..."

SENSITIVE_PORTS=(22 3389 1433 3306 5432 6379 27017)

aws ec2 describe-network-acls \
  --query 'NetworkAcls[*]' \
  --output json | jq -c '.[]' | while IFS= read -r nacl; do
    nacl_id=$(echo "$nacl" | jq -r '.NetworkAclId')
    is_default=$(echo "$nacl" | jq -r '.IsDefault')
    vpc_id=$(echo "$nacl" | jq -r '.VpcId')

    echo "$nacl" | jq -c '.Entries[] | select(.Egress == false and .RuleAction == "allow")' | while IFS= read -r entry; do
      cidr=$(echo "$entry" | jq -r '.CidrBlock // empty')
      from_port=$(echo "$entry" | jq -r '.PortRange.From // 0')
      to_port=$(echo "$entry" | jq -r '.PortRange.To // 65535')
      protocol=$(echo "$entry" | jq -r '.Protocol')

      # Protocol -1 means all traffic
      if [ "$protocol" = "-1" ] && [ "$cidr" = "0.0.0.0/0" ]; then
        emit_finding "HIGH" "SECURITY_EXPOSURE" "${nacl_id}(${vpc_id})" \
          "Network ACL has allow-all inbound rule for 0.0.0.0/0 on all protocols - security groups are the only remaining control layer" \
          "Protocol=all,CIDR=0.0.0.0/0,IsDefault=${is_default}" \
          "Add explicit deny rules for sensitive administrative ports in the NACL before the allow-all rule"
        break
      fi

      for port in "${SENSITIVE_PORTS[@]}"; do
        if [ "$from_port" -le "$port" ] && [ "$to_port" -ge "$port" ] && [ "$cidr" = "0.0.0.0/0" ] 2>/dev/null; then
          emit_finding "MEDIUM" "SECURITY_EXPOSURE" "${nacl_id}(${vpc_id})" \
            "Network ACL permits inbound 0.0.0.0/0 access to sensitive port ${port}" \
            "Port=${port},CIDR=0.0.0.0/0,Protocol=${protocol}" \
            "Add a NACL deny rule for port ${port} from 0.0.0.0/0 to enforce network-layer defence in depth"
          break
        fi
      done
    done
  done

echo "[*] VPC risk scan complete. Findings written to ${OUTPUT_FILE}"
EOF
chmod +x risk-service/vpc-risk.sh

12. Secrets Manager and Credential Risk Detection

Secrets are the bridge between identity and access. A rotation-disabled database credential, a secret that has not been accessed in months but remains active, or a plaintext password stored in SSM Parameter Store as a standard String rather than a SecureString represents the same class of risk as an overly permissive IAM role, except that it is far less likely to appear in any access review. The 2025 Tenable Cloud Security Risk Report found that plaintext secrets and incomplete credential lifecycle management were among the most pervasive gaps in enterprise AWS environments. The AWS Security Blog’s guidance on secret rotation is explicit that credentials not rotating on a schedule should be treated as compromised at an unknown point in the past.

The detector checks for Secrets Manager secrets with rotation disabled, secrets that have not been rotated within the configured window, secrets unused for more than 90 days that remain active, and SSM Parameter Store parameters whose names suggest credentials but are stored as unencrypted String types rather than SecureString.

cat > risk-service/secrets-risk.sh << 'EOF'
#!/usr/bin/env bash
# secrets-risk.sh - Material Secrets Manager and credential risk detection
# Requires: aws cli v2, jq

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/secrets-findings.txt"
STALE_DAYS=90
mkdir -p findings

emit_finding() {
  local severity="$1" risk_type="$2" resource="$3" finding="$4" evidence="$5" recommendation="$6"
  echo "${severity}|${risk_type}|${ACCOUNT}|${REGION}|secrets|${resource}|${finding}|${evidence}|${recommendation}" \
    | tee -a "$OUTPUT_FILE"
}

echo "=== Secrets Manager Material Risk Scan: account=${ACCOUNT} region=${REGION} ===" | tee "$OUTPUT_FILE"

now_epoch=$(date +%s)

# -----------------------------------------------------------------------
# 1. Secrets with rotation disabled
# -----------------------------------------------------------------------
echo "[*] Checking Secrets Manager rotation status..."

aws secretsmanager list-secrets \
  --query 'SecretList[*]' \
  --output json | jq -c '.[]' | while IFS= read -r secret; do
    secret_name=$(echo "$secret" | jq -r '.Name')
    rotation_enabled=$(echo "$secret" | jq -r '.RotationEnabled // false')
    last_changed=$(echo "$secret" | jq -r '.LastChangedDate // empty')

    # Skip secrets that appear to be non-credential config
    name_lower=$(echo "$secret_name" | tr '[:upper:]' '[:lower:]')
    is_credential=false
    for keyword in password passwd credential key token api db database secret; do
      if echo "$name_lower" | grep -q "$keyword"; then
        is_credential=true
        break
      fi
    done

    if [ "$rotation_enabled" = "false" ] && [ "$is_credential" = "true" ]; then
      emit_finding "HIGH" "SECURITY_EXPOSURE" "$secret_name" \
        "Secret with credential naming pattern does not have automatic rotation enabled - the credential may be compromised at an unknown point in the past without any indication" \
        "RotationEnabled=false" \
        "Enable rotation: configure a rotation Lambda or use Secrets Manager managed rotation for supported database types"
    fi

    # -----------------------------------------------------------------------
    # 2. Secrets not accessed for more than STALE_DAYS days
    # -----------------------------------------------------------------------
    last_accessed=$(echo "$secret" | jq -r '.LastAccessedDate // empty')
    if [ -n "$last_accessed" ]; then
      accessed_epoch=$(date -d "$last_accessed" +%s 2>/dev/null || date -j -f "%Y-%m-%dT%H:%M:%S+00:00" "$last_accessed" +%s 2>/dev/null || echo "0")
      age_days=$(( (now_epoch - accessed_epoch) / 86400 ))

      if [ "$age_days" -gt "$STALE_DAYS" ] 2>/dev/null; then
        emit_finding "MEDIUM" "SECURITY_EXPOSURE" "$secret_name" \
          "Secret has not been accessed in ${age_days} days but remains active - stale active credentials widen blast radius unnecessarily" \
          "LastAccessed=${last_accessed},AgeDays=${age_days}" \
          "Review whether the secret is still required; if not, disable or delete it to reduce credential exposure surface"
      fi
    fi

    # -----------------------------------------------------------------------
    # 3. Secrets overdue for rotation despite rotation being enabled
    # -----------------------------------------------------------------------
    if [ "$rotation_enabled" = "true" ]; then
      rotation_days=$(echo "$secret" | jq -r '.RotationRules.AutomaticallyAfterDays // 0')
      last_rotated=$(echo "$secret" | jq -r '.LastRotatedDate // empty')

      if [ -n "$last_rotated" ] && [ "$rotation_days" -gt 0 ] 2>/dev/null; then
        rotated_epoch=$(date -d "$last_rotated" +%s 2>/dev/null || date -j -f "%Y-%m-%dT%H:%M:%S+00:00" "$last_rotated" +%s 2>/dev/null || echo "0")
        days_since_rotation=$(( (now_epoch - rotated_epoch) / 86400 ))
        overdue=$(( days_since_rotation - rotation_days ))

        if [ "$overdue" -gt 7 ] 2>/dev/null; then
          emit_finding "HIGH" "SECURITY_EXPOSURE" "$secret_name" \
            "Secret rotation is enabled with a ${rotation_days}-day schedule but the last rotation was ${days_since_rotation} days ago - rotation has silently failed" \
            "RotationDays=${rotation_days},DaysSinceRotation=${days_since_rotation}" \
            "Investigate rotation Lambda function health; check CloudWatch logs for rotation failures on this secret"
        fi
      fi
    fi
  done

# -----------------------------------------------------------------------
# 4. SSM Parameter Store parameters with credential names stored as plaintext
# -----------------------------------------------------------------------
echo "[*] Checking SSM Parameter Store for plaintext credential parameters..."

CREDENTIAL_KEYWORDS=("password" "passwd" "secret" "credential" "apikey" "api_key" "token" "private_key" "db_pass")

for keyword in "${CREDENTIAL_KEYWORDS[@]}"; do
  aws ssm describe-parameters \
    --parameter-filters "Key=Name,Option=Contains,Values=${keyword}" \
    --query 'Parameters[?Type!=`SecureString`].{Name:Name,Type:Type}' \
    --output json 2>/dev/null | jq -c '.[]' | while IFS= read -r param; do
      param_name=$(echo "$param" | jq -r '.Name')
      param_type=$(echo "$param" | jq -r '.Type')

      emit_finding "HIGH" "SECURITY_EXPOSURE" "$param_name" \
        "SSM Parameter Store parameter with credential naming pattern is stored as ${param_type} rather than SecureString - value is stored and transmitted in plaintext" \
        "Name=${param_name},Type=${param_type},Keyword=${keyword}" \
        "Migrate the parameter to SecureString type encrypted with a customer managed KMS key"
  done
done

echo "[*] Secrets Manager risk scan complete. Findings written to ${OUTPUT_FILE}"
EOF
chmod +x risk-service/secrets-risk.sh

13. GuardDuty Threat Detection Coverage

GuardDuty is the primary AWS-native mechanism for detecting active threats in real time. It analyses CloudTrail management events, VPC Flow Logs, and DNS query logs to identify credential compromise, lateral movement, cryptomining, data exfiltration, and unusual API behaviour that no static configuration check would ever surface. The AWS Security Services Best Practices guide states clearly that GuardDuty must be enabled in every region and every account to provide meaningful coverage, because a single disabled region or unmanaged account represents a gap an attacker can exploit to operate undetected. Disabling GuardDuty, or running it without the S3 Protection or EKS Protection plans in environments that use those services, means that entire attack surfaces produce no findings regardless of what activity occurs within them.

The detector checks whether GuardDuty is enabled in the current region, whether the key protection plans are active, and whether findings are being delivered to a central aggregation point rather than silently accumulating in individual accounts. A finding that nobody reads is not a control.

cat > risk-service/guardduty-risk.sh << 'EOF'
#!/usr/bin/env bash
# guardduty-risk.sh - Material GuardDuty coverage risk detection
# Requires: aws cli v2, jq

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/guardduty-findings.txt"
mkdir -p findings

emit_finding() {
  local severity="$1" risk_type="$2" resource="$3" finding="$4" evidence="$5" recommendation="$6"
  echo "${severity}|${risk_type}|${ACCOUNT}|${REGION}|guardduty|${resource}|${finding}|${evidence}|${recommendation}" \
    | tee -a "$OUTPUT_FILE"
}

echo "=== GuardDuty Coverage Risk Scan: account=${ACCOUNT} region=${REGION} ===" | tee "$OUTPUT_FILE"

# -----------------------------------------------------------------------
# 1. GuardDuty not enabled in this region
# -----------------------------------------------------------------------
echo "[*] Checking GuardDuty detector status..."

detectors=$(aws guardduty list-detectors \
  --query 'DetectorIds' \
  --output json 2>/dev/null | jq -r '.[]' || echo "")

if [ -z "$detectors" ]; then
  emit_finding "CRITICAL" "REDUCED_VISIBILITY" "account-level" \
    "GuardDuty is not enabled in region ${REGION} - active threats including credential compromise and data exfiltration will not be detected" \
    "DetectorIds=none" \
    "Enable GuardDuty: aws guardduty create-detector --enable --finding-publishing-frequency FIFTEEN_MINUTES"
else
  for detector_id in $detectors; do
    detector_data=$(aws guardduty get-detector \
      --detector-id "$detector_id" \
      --output json 2>/dev/null)

    status=$(echo "$detector_data" | jq -r '.Status')

    if [ "$status" != "ENABLED" ]; then
      emit_finding "CRITICAL" "REDUCED_VISIBILITY" "$detector_id" \
        "GuardDuty detector exists but is not in ENABLED state - threat detection is inactive" \
        "Status=${status}" \
        "Re-enable the detector: aws guardduty update-detector --detector-id ${detector_id} --enable"
    fi

    # -----------------------------------------------------------------------
    # 2. Key protection plans not enabled
    # -----------------------------------------------------------------------
    s3_protection=$(echo "$detector_data" | jq -r '.DataSources.S3Logs.Status // "DISABLED"')
    malware_protection=$(echo "$detector_data" | jq -r '.Features[] | select(.Name == "MALWARE_PROTECTION") | .Status // "DISABLED"' 2>/dev/null || echo "DISABLED")
    eks_audit=$(echo "$detector_data" | jq -r '.Features[] | select(.Name == "EKS_AUDIT_LOGS") | .Status // "DISABLED"' 2>/dev/null || echo "DISABLED")
    lambda_protection=$(echo "$detector_data" | jq -r '.Features[] | select(.Name == "LAMBDA_NETWORK_LOGS") | .Status // "DISABLED"' 2>/dev/null || echo "DISABLED")

    if [ "$s3_protection" != "ENABLED" ]; then
      emit_finding "HIGH" "REDUCED_VISIBILITY" "$detector_id" \
        "GuardDuty S3 Protection is not enabled - data exfiltration and anomalous S3 API activity will not be detected" \
        "S3Protection=${s3_protection}" \
        "Enable S3 Protection in the GuardDuty console or via: aws guardduty update-detector --detector-id ${detector_id} --data-sources S3Logs={Enable=true}"
    fi

    # Check EKS coverage only if EKS clusters exist
    cluster_count=$(aws eks list-clusters --query 'clusters | length(@)' --output text 2>/dev/null || echo "0")
    if [ "${cluster_count:-0}" -gt 0 ] && [ "$eks_audit" != "ENABLED" ]; then
      emit_finding "HIGH" "REDUCED_VISIBILITY" "$detector_id" \
        "GuardDuty EKS Audit Log monitoring is not enabled despite ${cluster_count} EKS cluster(s) existing - malicious Kubernetes API activity will not be detected" \
        "EKSAuditLogs=${eks_audit},ClusterCount=${cluster_count}" \
        "Enable EKS protection in the GuardDuty console"
    fi

    # -----------------------------------------------------------------------
    # 3. Finding publishing frequency too low
    # -----------------------------------------------------------------------
    publish_freq=$(echo "$detector_data" | jq -r '.FindingPublishingFrequency // "SIX_HOURS"')

    if [ "$publish_freq" = "SIX_HOURS" ]; then
      emit_finding "MEDIUM" "REDUCED_VISIBILITY" "$detector_id" \
        "GuardDuty finding publishing frequency is set to SIX_HOURS - active threats will not surface for up to six hours after detection" \
        "FindingPublishingFrequency=${publish_freq}" \
        "Set frequency to FIFTEEN_MINUTES: aws guardduty update-detector --detector-id ${detector_id} --finding-publishing-frequency FIFTEEN_MINUTES"
    fi

    # -----------------------------------------------------------------------
    # 4. High severity findings unacknowledged for more than 7 days
    # -----------------------------------------------------------------------
    echo "[*] Checking for unacknowledged high severity GuardDuty findings..."

    seven_days_ago=$(date -u -d "7 days ago" '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null \
      || date -u -v-7d '+%Y-%m-%dT%H:%M:%SZ')

    unacknowledged=$(aws guardduty list-findings \
      --detector-id "$detector_id" \
      --finding-criteria '{
        "Criterion": {
          "severity": {"Gte": 7},
          "service.archived": {"Eq": ["false"]},
          "updatedAt": {"Lt": ["'"${seven_days_ago}"'"]}
        }
      }' \
      --query 'FindingIds | length(@)' \
      --output text 2>/dev/null || echo "0")

    if [ "${unacknowledged:-0}" -gt 0 ] 2>/dev/null; then
      emit_finding "HIGH" "REDUCED_VISIBILITY" "$detector_id" \
        "${unacknowledged} high severity GuardDuty finding(s) have been active for more than 7 days without being archived or remediated" \
        "UnacknowledgedHighFindings=${unacknowledged},OlderThan=7days" \
        "Review and remediate or archive these findings in the GuardDuty console"
    fi

  done
fi

echo "[*] GuardDuty risk scan complete. Findings written to ${OUTPUT_FILE}"
EOF
chmod +x risk-service/guardduty-risk.sh

14. AWS Organisations and Account Governance

A single account’s security posture is meaningless in isolation if the organisation that contains it has no guardrails at the boundary. Service Control Policies are the mechanism by which an organisation prevents individual accounts from removing their own controls. Without them, a compromised account can disable GuardDuty, stop CloudTrail, remove backup vaults, and drain S3 buckets entirely, and no IAM policy within that account can stop it. AWS’s September 2025 update to SCP capabilities extended full IAM policy language support to SCPs, making it possible to express far more precise guardrails than were previously possible, and the Towards The Cloud SCP reference documents the foundational policies that every production organisation should have in place. The absence of even basic SCPs means that account level IAM is the only control layer, and a single privilege escalation removes it entirely.

As of June 2025, AWS mandates root MFA for all member accounts. This script verifies that the organisation has SCPs attached that protect security tooling, that GuardDuty delegation is configured, that member account root credentials are properly governed, and that the management account is not being used as a workload account, which would exempt it from all SCP protection.

cat > risk-service/organisations-risk.sh << 'EOF'
#!/usr/bin/env bash
# organisations-risk.sh - AWS Organisations and account governance risk detection
# Requires: aws cli v2, jq, organisations:ListPolicies permissions
# Note: This script must run from a principal with organisations read access

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/organisations-findings.txt"
mkdir -p findings

emit_finding() {
  local severity="$1" risk_type="$2" resource="$3" finding="$4" evidence="$5" recommendation="$6"
  echo "${severity}|${risk_type}|${ACCOUNT}|${REGION}|organisations|${resource}|${finding}|${evidence}|${recommendation}" \
    | tee -a "$OUTPUT_FILE"
}

echo "=== Organisations Governance Risk Scan: account=${ACCOUNT} ===" | tee "$OUTPUT_FILE"

# -----------------------------------------------------------------------
# 1. Check if this account is part of an AWS Organisation
# -----------------------------------------------------------------------
org_info=$(aws organizations describe-organization --output json 2>/dev/null || echo '{}')

if [ "$(echo "$org_info" | jq 'has("Organization")')" != "true" ]; then
  emit_finding "HIGH" "PRIVILEGE_ESCALATION" "account-level" \
    "Account is not part of an AWS Organization - no SCP guardrails exist and account level IAM is the only control boundary" \
    "OrganizationMember=false" \
    "Enrol the account in an AWS Organization and apply SCPs to prevent control plane manipulation"
  echo "[*] Account is not in an organisation. Skipping further organisation checks."
  exit 0
fi

master_account=$(echo "$org_info" | jq -r '.Organization.MasterAccountId')
is_master=$([ "$ACCOUNT" = "$master_account" ] && echo "true" || echo "false")

# -----------------------------------------------------------------------
# 2. Management account being used as a workload account
# -----------------------------------------------------------------------
if [ "$is_master" = "true" ]; then
  running_instances=$(aws ec2 describe-instances \
    --filters "Name=instance-state-name,Values=running" \
    --query 'Reservations[*].Instances[*].InstanceId' \
    --output json | jq 'flatten | length')

  if [ "${running_instances:-0}" -gt 0 ] 2>/dev/null; then
    emit_finding "CRITICAL" "PRIVILEGE_ESCALATION" "management-account" \
      "The AWS Organizations management account has ${running_instances} running EC2 instances - workloads in the management account are exempt from all SCPs and represent an unguarded escalation path" \
      "ManagementAccount=true,RunningInstances=${running_instances}" \
      "Migrate all workloads out of the management account; use it only for billing and organisation governance"
  fi
fi

# -----------------------------------------------------------------------
# 3. No SCPs attached to the root or any OU
# -----------------------------------------------------------------------
echo "[*] Checking for SCP coverage at root level..."

scp_policies=$(aws organizations list-policies \
  --filter SERVICE_CONTROL_POLICY \
  --query 'Policies[?Id!=`p-FullAWSAccess`] | length(@)' \
  --output text 2>/dev/null || echo "0")

if [ "${scp_policies:-0}" = "0" ]; then
  emit_finding "CRITICAL" "PRIVILEGE_ESCALATION" "organisation-level" \
    "No custom Service Control Policies exist beyond the default FullAWSAccess - any account can disable monitoring, delete backups, and destroy infrastructure without organisational guardrails" \
    "CustomSCPs=0" \
    "Create SCPs that at minimum deny: cloudtrail:StopLogging, guardduty:DeleteDetector, backup:DeleteBackupVault, kms:ScheduleKeyDeletion across all member accounts"
fi

# -----------------------------------------------------------------------
# 4. GuardDuty not delegated to a security account
# -----------------------------------------------------------------------
echo "[*] Checking GuardDuty organisational delegation..."

guardduty_delegated=$(aws guardduty list-organization-admin-accounts \
  --query 'AdminAccounts | length(@)' \
  --output text 2>/dev/null || echo "0")

if [ "${guardduty_delegated:-0}" = "0" ]; then
  emit_finding "HIGH" "REDUCED_VISIBILITY" "organisation-level" \
    "GuardDuty has no delegated administrator account - findings from member accounts are not centralised and individual accounts can disable their own threat detection" \
    "GuardDutyDelegatedAdmin=none" \
    "Delegate GuardDuty administration to a dedicated security account: aws guardduty enable-organization-admin-account --admin-account-id <SECURITY_ACCOUNT_ID>"
fi

# -----------------------------------------------------------------------
# 5. CloudTrail not configured at organisation level
# -----------------------------------------------------------------------
echo "[*] Checking for organisation-level CloudTrail..."

org_trails=$(aws cloudtrail describe-trails \
  --include-shadow-trails false \
  --query 'trailList[?IsOrganizationTrail==`true`] | length(@)' \
  --output text 2>/dev/null || echo "0")

if [ "${org_trails:-0}" = "0" ]; then
  emit_finding "HIGH" "REDUCED_VISIBILITY" "organisation-level" \
    "No organisation-level CloudTrail exists - member accounts manage their own audit logging and can stop or delete it without affecting other accounts" \
    "OrgTrails=0" \
    "Create an organisation trail from the management account to centralise all member account API activity into a protected central bucket"
fi

# -----------------------------------------------------------------------
# 6. IAM Access Analyser not enabled at organisation level
# -----------------------------------------------------------------------
echo "[*] Checking IAM Access Analyser organisation coverage..."

org_analysers=$(aws accessanalyzer list-analyzers \
  --query 'analyzers[?type==`ORGANIZATION`] | length(@)' \
  --output text 2>/dev/null || echo "0")

if [ "${org_analysers:-0}" = "0" ]; then
  emit_finding "MEDIUM" "SECURITY_EXPOSURE" "organisation-level" \
    "No IAM Access Analyser with ORGANIZATION scope is configured - cross-account and cross-service resource exposure is not being automatically detected" \
    "OrgScopeAnalysers=0" \
    "Create an organisation-scoped analyser: aws accessanalyzer create-analyzer --analyzer-name org-analyser --type ORGANIZATION"
fi

echo "[*] Organisations risk scan complete. Findings written to ${OUTPUT_FILE}"
EOF
chmod +x risk-service/organisations-risk.sh

15. Identity Blast Radius and Production Takeover

If there is a single category that the original exposure model underweights, it is this one. Modern AWS incidents are identity first, not infrastructure first. The attacker rarely exploits a kernel vulnerability; they find a credential, and then they walk the identity graph until they own the account. Sysdig’s November 2025 case study traced exactly this path: a credential in a public S3 bucket, privilege escalation through Lambda code injection, and lateral movement across nineteen principals, all in under ten minutes. Aqua’s research into default IAM roles showed that services such as SageMaker, Glue, and EMR silently create roles with overly broad permissions that introduce privilege escalation paths nobody chose. And in March 2026, AWS itself added UpdateAssumeRolePolicy to its tracked threat techniques because attackers increasingly modify the trust policy of an existing role to add a principal they control, creating a takeover path with no new role and no new policy to alert on.

This means the identity detector cannot simply list overpermissioned roles. It must trace the blast radius: can any non break glass identity reach administrative control indirectly, modify the guardrails that are supposed to contain it, or pass a privileged role to compute it controls. It must also find the machine identity weaknesses that turn a leaked key into a standing foothold, and the recovery denial permissions that let one identity destroy the means of recovering from its own actions.

cat > risk-identity/identity-blast-radius.sh << 'EOF'
#!/usr/bin/env bash
# identity-blast-radius.sh - Identity takeover and blast radius detection
# Requires: aws cli v2, jq
# Answers: can one identity reach admin, disable guardrails, or deny recovery?

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/identity-findings.txt"
mkdir -p findings

emit() {
  # severity|risk_type|account|region|service|resource|finding|evidence|abuse_path|business_loss|detect|contain|recover|recommendation
  echo "$1|$2|${ACCOUNT}|${REGION}|identity|$3|$4|$5|$6|$7|$8|$9|${10}|${11}" | tee -a "$OUTPUT_FILE"
}

echo "=== Identity Blast Radius Scan: account=${ACCOUNT} ===" | tee "$OUTPUT_FILE"

# Permissions that allow indirect escalation to admin
ESCALATION_ACTIONS=(
  "iam:CreatePolicyVersion" "iam:SetDefaultPolicyVersion" "iam:AttachUserPolicy"
  "iam:AttachRolePolicy" "iam:PutUserPolicy" "iam:PutRolePolicy"
  "iam:CreateAccessKey" "iam:UpdateAssumeRolePolicy" "iam:PassRole"
  "lambda:UpdateFunctionCode" "lambda:CreateFunction" "sts:AssumeRole"
)

# Permissions that disable the guardrails meant to contain the identity
GUARDRAIL_ACTIONS=(
  "organizations:DetachPolicy" "organizations:DeletePolicy" "organizations:LeaveOrganization"
  "guardduty:DeleteDetector" "guardduty:UpdateDetector" "config:StopConfigurationRecorder"
  "cloudtrail:StopLogging" "cloudtrail:DeleteTrail" "securityhub:DisableSecurityHub"
)

# Permissions that deny recovery - destroying the means of getting back
RECOVERY_DENIAL_ACTIONS=(
  "backup:DeleteBackupVault" "backup:DeleteRecoveryPoint" "kms:ScheduleKeyDeletion"
  "kms:DisableKey" "rds:DeleteDBClusterSnapshot" "ec2:DeleteSnapshot"
)

evaluate_role() {
  local role_name="$1"
  # Skip break glass roles
  echo "$role_name" | grep -qiE "break.?glass|emergency" && return

  # Gather all actions from attached managed policies and inline policies
  local all_actions=""

  # Inline policies
  for pname in $(aws iam list-role-policies --role-name "$role_name" --query 'PolicyNames' --output json 2>/dev/null | jq -r '.[]'); do
    doc=$(aws iam get-role-policy --role-name "$role_name" --policy-name "$pname" --query 'PolicyDocument' --output json 2>/dev/null)
    actions=$(echo "$doc" | jq -r '[.Statement[]? | select(.Effect=="Allow") | .Action] | flatten | .[]?' 2>/dev/null)
    all_actions="${all_actions}
${actions}"
  done

  # Attached managed policies - check for AdministratorAccess directly
  for parn in $(aws iam list-attached-role-policies --role-name "$role_name" --query 'AttachedPolicies[*].PolicyArn' --output json 2>/dev/null | jq -r '.[]'); do
    if echo "$parn" | grep -q "AdministratorAccess"; then
      all_actions="${all_actions}
*"
    fi
    # Pull customer managed policy actions
    if echo "$parn" | grep -q "arn:aws:iam::${ACCOUNT}:policy"; then
      ver=$(aws iam get-policy --policy-arn "$parn" --query 'Policy.DefaultVersionId' --output text 2>/dev/null)
      doc=$(aws iam get-policy-version --policy-arn "$parn" --version-id "$ver" --query 'PolicyVersion.Document' --output json 2>/dev/null)
      actions=$(echo "$doc" | jq -r '[.Statement[]? | select(.Effect=="Allow") | .Action] | flatten | .[]?' 2>/dev/null)
      all_actions="${all_actions}
${actions}"
    fi
  done

  has_action() {
    echo "$all_actions" | grep -qE "(^\*$|^${1//:/\\:}$|^${1%%:*}:\*$)"
  }

  # 1. Indirect path to admin
  for a in "${ESCALATION_ACTIONS[@]}"; do
    if has_action "$a"; then
      emit "CRITICAL" "PRIVILEGE_ESCALATION" "$role_name" \
        "Role can reach administrative control indirectly via ${a}" \
        "Action=${a}" \
        "Compromise role -> ${a} -> attach admin or assume privileged role -> full account control" \
        "Total loss of account control; attacker can perform any action including destroying the business" \
        "8" "45" "180" \
        "Remove ${a} or constrain it with resource and condition limits; PassRole and UpdateFunctionCode are the highest priority"
      break
    fi
  done

  # 2. Can disable guardrails
  for a in "${GUARDRAIL_ACTIONS[@]}"; do
    if has_action "$a"; then
      emit "CRITICAL" "GUARDRAIL_REMOVAL" "$role_name" \
        "Role can disable a security guardrail via ${a}" \
        "Action=${a}" \
        "Compromise role -> ${a} -> blind detection and remove containment before acting" \
        "Attacker operates undetected; incident response loses its evidence and controls" \
        "120" "240" "480" \
        "Move guardrail-disabling permissions behind an SCP and a break glass role only"
      break
    fi
  done

  # 3. Recovery denial
  for a in "${RECOVERY_DENIAL_ACTIONS[@]}"; do
    if has_action "$a"; then
      emit "CRITICAL" "RECOVERY_DENIAL" "$role_name" \
        "Role can destroy the means of recovery via ${a}" \
        "Action=${a}" \
        "Compromise role -> ${a} -> delete backups or keys -> recovery becomes impossible" \
        "Permanent, unrecoverable data loss; ransomware leverage with no clean restore path" \
        "120" "720" "5760" \
        "Deny recovery-destruction actions for all non break glass identities via SCP"
      break
    fi
  done
}

echo "[*] Evaluating role blast radius (this can take time on large accounts)..."
for role in $(aws iam list-roles --query 'Roles[*].RoleName' --output json | jq -r '.[]'); do
  # Skip AWS service-linked roles
  echo "$role" | grep -q "^AWSServiceRoleFor" && continue
  evaluate_role "$role"
done

# 4. Machine identity abuse - IAM users with active keys
echo "[*] Checking machine identity weaknesses..."
for user in $(aws iam list-users --query 'Users[*].UserName' --output json | jq -r '.[]'); do
  aws iam list-access-keys --user-name "$user" --query 'AccessKeyMetadata[?Status==`Active`]' --output json | jq -c '.[]' | while IFS= read -r key; do
    kid=$(echo "$key" | jq -r '.AccessKeyId')
    last_used=$(aws iam get-access-key-last-used --access-key-id "$kid" --query 'AccessKeyLastUsed.LastUsedDate' --output text 2>/dev/null || echo "None")
    emit "HIGH" "MACHINE_IDENTITY" "${user}/${kid}" \
      "IAM user has a long-lived active access key (last used: ${last_used})" \
      "AccessKeyId=${kid},LastUsed=${last_used}" \
      "Key leaks via code, logs, or laptop -> standing credential with no expiry -> persistent foothold" \
      "Persistent unauthorized access that survives password changes and session expiry" \
      "120" "30" "60" \
      "Replace with short-lived role credentials; if a key is unavoidable, rotate on a schedule and scope tightly"
  done
done

# 5. Roles trusted by external accounts without conditions
echo "[*] Checking cross-account trust relationships..."
for role in $(aws iam list-roles --query 'Roles[*].RoleName' --output json | jq -r '.[]'); do
  echo "$role" | grep -q "^AWSServiceRoleFor" && continue
  trust=$(aws iam get-role --role-name "$role" --query 'Role.AssumeRolePolicyDocument' --output json 2>/dev/null)

  # External AWS account principals
  ext=$(echo "$trust" | jq -r --arg acct "$ACCOUNT" '
    [.Statement[]? | select(.Effect=="Allow") | .Principal.AWS] | flatten | .[]? |
    select(test("arn:aws:iam::") and (test($acct) | not))' 2>/dev/null)
  if [ -n "$ext" ]; then
    has_condition=$(echo "$trust" | jq -r '[.Statement[]? | select(.Principal.AWS) | has("Condition")] | any')
    if [ "$has_condition" != "true" ]; then
      emit "HIGH" "EXTERNAL_TRUST" "$role" \
        "Role is assumable by an external AWS account with no condition constraints" \
        "ExternalPrincipal=$(echo "$ext" | tr '\n' ' ')" \
        "External account compromised -> assume this role -> pivot into your account" \
        "Third-party breach becomes your breach with no additional barrier" \
        "240" "60" "120" \
        "Add an ExternalId condition and scope the trust to specific principals"
    fi
  fi

  # GitHub OIDC roles without repo constraints
  oidc=$(echo "$trust" | jq -r '[.Statement[]? | select(.Principal.Federated // "" | test("token.actions.githubusercontent.com"))] | length')
  if [ "${oidc:-0}" -gt 0 ] 2>/dev/null; then
    sub_constrained=$(echo "$trust" | jq -r '[.Statement[]? | .Condition.StringLike."token.actions.githubusercontent.com:sub" // .Condition.StringEquals."token.actions.githubusercontent.com:sub" // empty] | length')
    if [ "${sub_constrained:-0}" = "0" ]; then
      emit "CRITICAL" "EXTERNAL_TRUST" "$role" \
        "GitHub OIDC role has no repository or branch constraint on the subject claim" \
        "OIDCProvider=github,SubConstraint=none" \
        "Any GitHub repo using AWS OIDC -> assume this role -> deploy into your account" \
        "Any public or compromised repository can obtain your AWS credentials" \
        "180" "45" "120" \
        "Constrain the sub claim to repo:ORG/REPO:ref:refs/heads/main or similar"
    fi
  fi
done

echo "[*] Identity blast radius scan complete."
EOF
chmod +x risk-identity/identity-blast-radius.sh

16. Destructive Capability Testing

Detecting dangerous state is necessary but not sufficient, because state tells you a capability exists while it says nothing about how quickly that capability turns into a smoking hole. The evolution here is to test the dangerous action rather than only observe the dangerous configuration. The question is not “is deletion protection off” but “if this identity were compromised right now, could the attacker destroy production before anyone could stop them, and how long would we then take to bring it back.”

This is read only testing, which means the script does not perform the destructive actions. It uses the IAM policy simulator and capability evaluation to prove whether the action would succeed, and it pairs each provable destruction capability with the recovery window drawn from the actual backup and snapshot state. The output is deliberately blunt because it is written for executives: a single field that says whether production can be deleted, and a single field that says how long recovery would take. AWS’s own privilege escalation documentation confirms that the chaining of valid actions is the real attack, which is why simulating the action is more honest than auditing the setting.

cat > risk-destructive/destructive-capability.sh << 'EOF'
#!/usr/bin/env bash
# destructive-capability.sh - Can an attacker destroy production, and how fast?
# Requires: aws cli v2, jq. Uses iam:SimulatePrincipalPolicy (read only).

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/destructive-findings.txt"
mkdir -p findings

emit() {
  echo "$1|$2|${ACCOUNT}|${REGION}|destructive|$3|$4|$5|$6|$7|$8|$9|${10}|${11}" | tee -a "$OUTPUT_FILE"
}

echo "=== Destructive Capability Scan: account=${ACCOUNT} region=${REGION} ===" | tee "$OUTPUT_FILE"

# The destructive actions that define "can destroy production"
declare -A DESTROY_ACTIONS=(
  ["ec2:TerminateInstances"]="terminate all production EC2 instances"
  ["rds:DeleteDBCluster"]="delete the Aurora production cluster"
  ["rds:DeleteDBInstance"]="delete production RDS databases"
  ["route53:ChangeResourceRecordSets"]="repoint or delete production DNS records"
  ["eks:DeleteCluster"]="destroy production EKS clusters"
  ["kms:ScheduleKeyDeletion"]="schedule deletion of production encryption keys"
  ["s3:DeleteBucket"]="delete production S3 buckets"
)

# Estimate recovery window for a given destruction type from real state
recovery_window_mins() {
  case "$1" in
    rds:DeleteDBCluster|rds:DeleteDBInstance)
      # If automated backups exist, recovery is hours; if not, it is permanent
      retention=$(aws rds describe-db-instances --query 'max_by(DBInstances, &BackupRetentionPeriod).BackupRetentionPeriod' --output text 2>/dev/null || echo "0")
      [ "${retention:-0}" = "0" ] && echo "999999" || echo "540"   # 9h if recoverable
      ;;
    kms:ScheduleKeyDeletion) echo "999999" ;;   # cryptographic loss is effectively permanent
    route53:ChangeResourceRecordSets) echo "120" ;;
    eks:DeleteCluster) echo "480" ;;
    ec2:TerminateInstances) echo "240" ;;
    s3:DeleteBucket) echo "720" ;;
    *) echo "360" ;;
  esac
}

# Evaluate a single principal against the destruction action set
test_principal() {
  local arn="$1" name="$2"
  for action in "${!DESTROY_ACTIONS[@]}"; do
    decision=$(aws iam simulate-principal-policy \
      --policy-source-arn "$arn" \
      --action-names "$action" \
      --query 'EvaluationResults[0].EvalDecision' \
      --output text 2>/dev/null || echo "error")

    if [ "$decision" = "allowed" ]; then
      window=$(recovery_window_mins "$action")
      if [ "$window" = "999999" ]; then
        loss="PERMANENT, unrecoverable - no clean restore path exists"
        sev="CRITICAL"
        recover="999999"
      else
        loss="Production down until restore completes (~$((window/60))h recovery window)"
        sev="CRITICAL"
        recover="$window"
      fi
      emit "$sev" "DESTRUCTIVE_CAPABILITY" "$name" \
        "Principal CAN ${DESTROY_ACTIONS[$action]} (DELETE_PROD_CAPABLE=TRUE)" \
        "Action=${action},SimulatedDecision=allowed" \
        "Compromise this principal -> ${action} -> production destroyed" \
        "$loss" \
        "10" "60" "$recover" \
        "Remove ${action} from this principal or require MFA and break glass approval"
    fi
  done
}

echo "[*] Simulating destruction capability for human-assumable roles..."
for role in $(aws iam list-roles --query 'Roles[*].{Name:RoleName,Arn:Arn}' --output json | jq -c '.[]'); do
  rname=$(echo "$role" | jq -r '.Name')
  rarn=$(echo "$role" | jq -r '.Arn')
  echo "$rname" | grep -q "^AWSServiceRoleFor" && continue
  echo "$rname" | grep -qiE "break.?glass|emergency" && continue
  test_principal "$rarn" "$rname"
done

echo "[*] Destructive capability scan complete."
EOF
chmod +x risk-destructive/destructive-capability.sh

17. Detection Blindness

Many organisations are secure right up until the moment they become invisible. An attacker who can switch off the logs, shorten retention, or delete the audit account does not need to be quiet, because there is no longer anyone watching. This is why detection blindness deserves its own family rather than being folded into the observability service checks. The service checks tell you whether logging is on today. The blindness checks tell you whether an attacker can turn it off tomorrow and how long you would operate in the dark before noticing.

Recent abuse research makes this concrete. Sysdig’s LLMjacking analysis found that attackers actively call DeleteModelInvocationLoggingConfiguration to suppress Bedrock logging, deliberately avoid credentials that have logging enabled, and prefer the Converse API precisely because its actions did not automatically appear in CloudTrail. The lesson is that suppression of evidence is now a standard early step in the attack, not an afterthought, and a risk programme that cannot detect the ability to go dark is measuring the wrong thing.

cat > risk-observability/detection-blindness.sh << 'EOF'
#!/usr/bin/env bash
# detection-blindness.sh - Can an attacker hide by disabling detection?
# Requires: aws cli v2, jq

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/blindness-findings.txt"
mkdir -p findings

emit() {
  echo "$1|$2|${ACCOUNT}|${REGION}|observability|$3|$4|$5|$6|$7|$8|$9|${10}|${11}" | tee -a "$OUTPUT_FILE"
}

echo "=== Detection Blindness Scan: account=${ACCOUNT} region=${REGION} ===" | tee "$OUTPUT_FILE"

# Who can blind the environment? Test the suppression actions against all roles.
BLINDING_ACTIONS=(
  "cloudtrail:StopLogging" "cloudtrail:DeleteTrail" "cloudtrail:UpdateTrail"
  "config:StopConfigurationRecorder" "config:DeleteConfigurationRecorder"
  "guardduty:DeleteDetector" "guardduty:UpdateDetector"
  "securityhub:DisableSecurityHub"
  "logs:DeleteLogGroup" "logs:PutRetentionPolicy"
  "bedrock:DeleteModelInvocationLoggingConfiguration"
)

echo "[*] Testing which principals can suppress detection..."
for role in $(aws iam list-roles --query 'Roles[*].{Name:RoleName,Arn:Arn}' --output json | jq -c '.[]'); do
  rname=$(echo "$role" | jq -r '.Name')
  rarn=$(echo "$role" | jq -r '.Arn')
  echo "$rname" | grep -q "^AWSServiceRoleFor" && continue
  echo "$rname" | grep -qiE "break.?glass|emergency|securityaudit|incident" && continue

  for action in "${BLINDING_ACTIONS[@]}"; do
    decision=$(aws iam simulate-principal-policy \
      --policy-source-arn "$rarn" --action-names "$action" \
      --query 'EvaluationResults[0].EvalDecision' --output text 2>/dev/null || echo "error")
    if [ "$decision" = "allowed" ]; then
      emit "CRITICAL" "DETECTION_BLINDNESS" "$rname" \
        "Principal can suppress detection via ${action}" \
        "Action=${action},SimulatedDecision=allowed" \
        "Compromise principal -> ${action} -> operate without generating evidence" \
        "Incident becomes undetectable and uninvestigable; breach dwell time extends indefinitely" \
        "1440" "2880" "2880" \
        "Deny detection-suppression actions for all non break glass roles via SCP; alert on any successful call"
      break
    fi
  done
done

# CloudWatch Log retention set too short on security-relevant groups
echo "[*] Checking log retention windows..."
aws logs describe-log-groups --query 'logGroups[*].{Name:logGroupName,Retention:retentionInDays}' --output json | jq -c '.[]' | while IFS= read -r lg; do
  name=$(echo "$lg" | jq -r '.Name')
  retention=$(echo "$lg" | jq -r '.Retention // 0')
  # Only flag security-relevant log groups
  echo "$name" | grep -qiE "cloudtrail|guardduty|config|vpc|flow|audit|security" || continue
  if [ "$retention" != "0" ] && [ "$retention" -lt 90 ] 2>/dev/null; then
    emit "MEDIUM" "DETECTION_BLINDNESS" "$name" \
      "Security-relevant log group has only ${retention} day retention" \
      "RetentionInDays=${retention}" \
      "Slow attack or delayed discovery -> evidence already aged out before investigation begins" \
      "Forensic timeline cannot be reconstructed; breach scope becomes guesswork" \
      "0" "0" "0" \
      "Set retention to at least 365 days for audit and detection log groups"
  fi
done

echo "[*] Detection blindness scan complete."
EOF
chmod +x risk-observability/detection-blindness.sh

18. Data Destruction and Encryption Hostage Risk

This category causes existential damage and it is the one where the difference between exposure and loss is starkest. “Backups enabled equals true” is an exposure statement. It tells you nothing about whether the backups live in the same account that an attacker would compromise, whether the vault is immutable, whether the encryption key can be deleted out from under the data, or whether the replica region is genuinely isolated. The right measure is the number of minutes from compromise to an irrecoverable state, because that single number tells the board whether a ransomware event is survivable or terminal.

The detector evaluates the recovery topology rather than the recovery configuration. It checks whether backups can be reached and deleted from the production account, whether vault lock is in force, whether the KMS keys protecting backups can be scheduled for deletion, and whether the disaster recovery region shares the blast radius of the primary. The encryption hostage scenario is specifically the one where an attacker does not exfiltrate data at all but simply schedules the deletion of the key, holding the still-present but now unreadable data to ransom.

cat > risk-data/data-destruction.sh << 'EOF'
#!/usr/bin/env bash
# data-destruction.sh - Minutes to irrecoverable state, not "backups enabled"
# Requires: aws cli v2, jq

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/data-destruction-findings.txt"
mkdir -p findings

emit() {
  echo "$1|$2|${ACCOUNT}|${REGION}|data|$3|$4|$5|$6|$7|$8|$9|${10}|${11}" | tee -a "$OUTPUT_FILE"
}

echo "=== Data Destruction Risk Scan: account=${ACCOUNT} region=${REGION} ===" | tee "$OUTPUT_FILE"

# 1. Backup vaults in the same account as production (no isolation)
echo "[*] Checking backup vault isolation and lock..."
aws backup list-backup-vaults --query 'BackupVaultList[*].BackupVaultName' --output json | jq -r '.[]' | while IFS= read -r vault; do
  details=$(aws backup describe-backup-vault --backup-vault-name "$vault" --output json 2>/dev/null || echo '{}')
  locked=$(echo "$details" | jq -r '.Locked // false')
  if [ "$locked" != "true" ]; then
    emit "CRITICAL" "IRRECOVERABLE_LOSS" "$vault" \
      "Backup vault is not locked and lives in the production account" \
      "Locked=false,Account=${ACCOUNT}" \
      "Compromise account -> backup:DeleteRecoveryPoint -> backups and production destroyed together" \
      "Ransomware with no clean restore; data loss is permanent" \
      "120" "60" "999999" \
      "Enable Vault Lock in compliance mode and copy recovery points to an isolated account"
  fi
done

# 2. KMS keys protecting data that can be scheduled for deletion (encryption hostage)
echo "[*] Checking for encryption hostage exposure..."
for key in $(aws kms list-keys --query 'Keys[*].KeyId' --output json | jq -r '.[]'); do
  meta=$(aws kms describe-key --key-id "$key" --output json 2>/dev/null)
  manager=$(echo "$meta" | jq -r '.KeyMetadata.KeyManager')
  state=$(echo "$meta" | jq -r '.KeyMetadata.KeyState')
  [ "$manager" = "AWS" ] && continue
  if [ "$state" = "Enabled" ]; then
    alias=$(aws kms list-aliases --key-id "$key" --query 'Aliases[0].AliasName' --output text 2>/dev/null || echo "no-alias")
    emit "HIGH" "ENCRYPTION_HOSTAGE" "${alias}(${key})" \
      "Customer managed key is deletable; data encrypted with it becomes unreadable if the key is scheduled for deletion" \
      "KeyState=Enabled,Deletable=true" \
      "Compromise identity with kms:ScheduleKeyDeletion -> schedule key deletion -> data present but permanently unreadable" \
      "Encryption hostage: attacker does not need to exfiltrate, only to delete the key" \
      "60" "30" "999999" \
      "Restrict kms:ScheduleKeyDeletion via key policy and SCP; consider multi-Region keys with separate custody"
  fi
done

# 3. Disaster recovery region sharing the primary blast radius
echo "[*] Checking cross-region recovery isolation..."
plans=$(aws backup list-backup-plans --query 'BackupPlansList[*].BackupPlanId' --output json | jq -r '.[]')
cross_region_copy=0
for pid in $plans; do
  dests=$(aws backup get-backup-plan --backup-plan-id "$pid" --query 'BackupPlan.Rules[*].CopyActions[*].DestinationBackupVaultArn' --output json 2>/dev/null | jq -r '.[][]?' || true)
  for d in $dests; do
    dest_region=$(echo "$d" | cut -d: -f4)
    [ "$dest_region" != "$REGION" ] && cross_region_copy=$((cross_region_copy+1))
  done
done
if [ "$cross_region_copy" = "0" ]; then
  emit "HIGH" "IRRECOVERABLE_LOSS" "account-level" \
    "No backup copy crosses into a second region; the DR region shares the primary blast radius" \
    "CrossRegionCopies=0" \
    "Region-wide event or account compromise -> single copy destroyed -> no surviving recovery point" \
    "A single regional failure or account breach is unrecoverable" \
    "120" "240" "5760" \
    "Add cross-region copy actions to backup plans and verify the destination is independently governed"
fi

echo "[*] Data destruction risk scan complete."
EOF
chmod +x risk-data/data-destruction.sh

19. Supply Chain and Pipeline Compromise

This is probably the fastest growing cloud risk category, and it is almost entirely absent from configuration-led audits because the danger does not live in a resource setting. It lives in the relationship between the CI system and production. A pipeline that can deploy directly to production, with no approval gate, mutable artifacts, and long-lived production credentials sitting in its environment, is a single compromise away from shipping malware to every customer under your own valid signature. Signed builds and provenance do not save you here, because a compromised pipeline produces validly signed malicious artifacts.

The detector focuses on the blast radius of the build system rather than the security of any individual repository. It looks for the deployment roles that CI can assume, evaluates whether those roles can reach production directly, checks for overly broad OIDC trust as covered in the identity family, and flags the absence of approval gates between build and production deploy. The GitHub OIDC subject-constraint check in the identity family and this pipeline family are deliberately complementary: one finds the trust weakness, the other finds what that weakness reaches.

cat > risk-supply-chain/pipeline-compromise.sh << 'EOF'
#!/usr/bin/env bash
# pipeline-compromise.sh - What can a compromised CI/CD pipeline reach?
# Requires: aws cli v2, jq

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/supply-chain-findings.txt"
mkdir -p findings

emit() {
  echo "$1|$2|${ACCOUNT}|${REGION}|supply-chain|$3|$4|$5|$6|$7|$8|$9|${10}|${11}" | tee -a "$OUTPUT_FILE"
}

echo "=== Pipeline Compromise Scan: account=${ACCOUNT} region=${REGION} ===" | tee "$OUTPUT_FILE"

# Identify likely deployment/CI roles by name or trust relationship
echo "[*] Identifying deployment and CI roles..."
for role in $(aws iam list-roles --query 'Roles[*].{Name:RoleName,Arn:Arn}' --output json | jq -c '.[]'); do
  rname=$(echo "$role" | jq -r '.Name')
  rarn=$(echo "$role" | jq -r '.Arn')
  echo "$rname" | grep -q "^AWSServiceRoleFor" && continue

  trust=$(aws iam get-role --role-name "$rname" --query 'Role.AssumeRolePolicyDocument' --output json 2>/dev/null)
  is_ci=false
  echo "$trust" | jq -r '.Statement[]?.Principal.Federated // empty' | grep -qE "githubusercontent|gitlab|bitbucket" && is_ci=true
  echo "$rname" | grep -qiE "deploy|ci|cd|pipeline|codebuild|codepipeline|terraform|github|actions" && is_ci=true
  [ "$is_ci" = "false" ] && continue

  # Can this CI role deploy directly to production compute or infra?
  for action in "lambda:UpdateFunctionCode" "ecs:UpdateService" "eks:UpdateClusterConfig" "cloudformation:UpdateStack" "ec2:RunInstances"; do
    decision=$(aws iam simulate-principal-policy --policy-source-arn "$rarn" --action-names "$action" \
      --query 'EvaluationResults[0].EvalDecision' --output text 2>/dev/null || echo "error")
    if [ "$decision" = "allowed" ]; then
      emit "CRITICAL" "SUPPLY_CHAIN" "$rname" \
        "CI/CD role can deploy directly to production via ${action} with no enforced approval gate" \
        "Action=${action},SimulatedDecision=allowed" \
        "Compromise CI -> ${action} -> ship malicious but validly signed artifact to production" \
        "Malware delivered to all customers under your own signature; full production compromise" \
        "4320" "2880" "10080" \
        "Insert a manual approval gate between build and prod deploy; split build and deploy roles"
      break
    fi
  done

  # Does the CI role hold standing admin or PassRole?
  for action in "iam:PassRole" "iam:AttachRolePolicy"; do
    decision=$(aws iam simulate-principal-policy --policy-source-arn "$rarn" --action-names "$action" \
      --query 'EvaluationResults[0].EvalDecision' --output text 2>/dev/null || echo "error")
    if [ "$decision" = "allowed" ]; then
      emit "CRITICAL" "SUPPLY_CHAIN" "$rname" \
        "CI/CD role can escalate privilege via ${action}" \
        "Action=${action}" \
        "Compromise CI -> ${action} -> grant itself admin -> own the account" \
        "Build system compromise becomes full account takeover" \
        "4320" "1440" "4320" \
        "Remove PassRole and policy-attachment permissions from pipeline roles"
      break
    fi
  done
done

# ECR repositories allowing mutable tags (artifact tampering)
echo "[*] Checking ECR image tag mutability..."
aws ecr describe-repositories --query 'repositories[*].{Name:repositoryName,Mutability:imageTagMutability}' --output json 2>/dev/null | jq -c '.[]' | while IFS= read -r repo; do
  name=$(echo "$repo" | jq -r '.Name')
  mut=$(echo "$repo" | jq -r '.Mutability')
  if [ "$mut" = "MUTABLE" ]; then
    emit "HIGH" "SUPPLY_CHAIN" "$name" \
      "ECR repository allows mutable image tags; a deployed tag can be overwritten with a different image" \
      "ImageTagMutability=MUTABLE" \
      "Compromise push credential -> overwrite :latest or release tag -> next deploy pulls malicious image" \
      "Tampered artifact reaches production silently; provenance is meaningless" \
      "2880" "720" "2880" \
      "Set image tag immutability to IMMUTABLE and enforce digest-pinned deploys"
  fi
done

echo "[*] Pipeline compromise scan complete."
EOF
chmod +x risk-supply-chain/pipeline-compromise.sh

20. Economic Denial of Service and Cloud Bankruptcy

Almost nobody audits this, and yet it is one of the few attacks that can end a company without ever touching a single byte of customer data. The cloud billing model means an attacker who gains a credential does not have to take anything down to do catastrophic harm; they can simply spend. The question that belongs in front of a CFO is direct: can an attacker burn the annual cloud budget in twenty four hours, and where are the controls that would stop them.

The most acute version of this in 2025 and 2026 is LLMjacking. Sysdig’s research found that abusing Amazon Bedrock with stolen credentials can cost a victim over forty six thousand dollars per day, rising to over one hundred thousand dollars per day with frontier models, and that this is now categorised under OWASP’s LLM10 Unbounded Consumption risk. The broader pattern is the denial of wallet attack, which weaponises auto-scaling and pay-per-use pricing to inflict financial damage without affecting availability, which is precisely why standard availability monitoring never catches it. The detector estimates the maximum twenty four hour burn the current configuration permits and flags the absence of the controls that would cap it.

cat > risk-economic/economic-dos.sh << 'EOF'
#!/usr/bin/env bash
# economic-dos.sh - Maximum 24h burn; can an attacker bankrupt the account?
# Requires: aws cli v2, jq

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/economic-findings.txt"
mkdir -p findings

emit() {
  echo "$1|$2|${ACCOUNT}|${REGION}|economic|$3|$4|$5|$6|$7|$8|$9|${10}|${11}" | tee -a "$OUTPUT_FILE"
}

echo "=== Economic DoS Scan: account=${ACCOUNT} region=${REGION} ===" | tee "$OUTPUT_FILE"

# 1. Bedrock model invocation without budget guardrails (LLMjacking)
echo "[*] Checking Bedrock LLMjacking exposure..."
bedrock_logging=$(aws bedrock get-model-invocation-logging-configuration --query 'loggingConfig' --output json 2>/dev/null || echo "null")
if [ "$bedrock_logging" = "null" ] || [ -z "$bedrock_logging" ]; then
  emit "HIGH" "DENIAL_OF_WALLET" "bedrock" \
    "Bedrock model invocation logging is not configured; LLMjacking would be both possible and invisible" \
    "InvocationLogging=disabled" \
    "Stolen credential -> invoke frontier models at scale -> \$46k-\$100k per day with no log trail" \
    "Catastrophic, uncapped daily spend; documented real-world losses exceed \$46,000/day" \
    "1440" "240" "60" \
    "Enable Bedrock invocation logging and apply per-model and per-account spend guardrails"
fi

# 2. Service quotas left at defaults that permit runaway GPU / compute
echo "[*] Checking high-cost compute quotas..."
gpu_quota=$(aws service-quotas get-service-quota \
  --service-code ec2 --quota-code L-417A185B \
  --query 'Quota.Value' --output text 2>/dev/null || echo "unknown")
if [ "$gpu_quota" != "unknown" ] && [ "$(printf '%.0f' "$gpu_quota" 2>/dev/null || echo 0)" -gt 0 ] 2>/dev/null; then
  emit "MEDIUM" "DENIAL_OF_WALLET" "ec2-gpu-quota" \
    "Account permits on-demand GPU (P-family) instances; quota=${gpu_quota} vCPU" \
    "GPUvCPUQuota=${gpu_quota}" \
    "Compromise credential -> launch maximum GPU fleet -> burn cost at thousands of dollars per hour" \
    "Runaway GPU spend; cryptomining and model-training abuse are the common goals" \
    "720" "120" "60" \
    "Reduce GPU quotas to the minimum required and alert on RunInstances for P/G families"
fi

# 3. No AWS Budgets configured with alerts
echo "[*] Checking for budget guardrails..."
budget_count=$(aws budgets describe-budgets --account-id "$ACCOUNT" --query 'Budgets | length(@)' --output text 2>/dev/null || echo "0")
if [ "${budget_count:-0}" = "0" ]; then
  emit "HIGH" "DENIAL_OF_WALLET" "account-level" \
    "No AWS Budgets are configured; there is no automated signal when spend spikes abnormally" \
    "Budgets=0" \
    "Any cost-based attack -> spend climbs unnoticed -> discovered only on the monthly invoice" \
    "Financial damage runs unchecked for up to a full billing cycle" \
    "43200" "240" "0" \
    "Create AWS Budgets with anomaly detection and SNS alerts on daily spend thresholds"
fi

# 4. Lambda reserved concurrency unbounded
echo "[*] Checking for unbounded Lambda concurrency..."
account_concurrency=$(aws lambda get-account-settings --query 'AccountLimit.ConcurrentExecutions' --output text 2>/dev/null || echo "unknown")
if [ "$account_concurrency" != "unknown" ] && [ "$account_concurrency" -gt 1000 ] 2>/dev/null; then
  emit "MEDIUM" "DENIAL_OF_WALLET" "lambda" \
    "Account Lambda concurrency limit is high (${account_concurrency}); a triggered function can scale into large invocation cost" \
    "ConcurrentExecutions=${account_concurrency}" \
    "Trigger a billable function in a loop -> auto-scale to the limit -> invocation-based denial of wallet" \
    "Serverless cost amplification without any availability signal" \
    "1440" "120" "30" \
    "Set per-function reserved concurrency limits and alarm on invocation-rate anomalies"
fi

echo "[*] Economic DoS scan complete."
EOF
chmod +x risk-economic/economic-dos.sh

21. Fraud and Integrity Attacks

Security teams overfocus on confidentiality because data breaches make headlines, but many companies, and especially banks, do not die from a leak. They die from an integrity failure. An attacker who can quietly change payment routing, flip a feature flag, mutate an event stream, or alter a reconciliation job can move money or corrupt records without ever causing an outage and without triggering a single availability alert. The question that matters in a financial services context is uncomfortable but exact: can an attacker move money without anyone noticing, because nothing broke.

This family is necessarily environment-specific because the integrity-critical resources differ by business, but the detection principle is general. Find the control points where a small, valid-looking change has an outsized and silent financial or data-integrity consequence, and find who can change them. The script below covers the AWS-native control points (DNS routing, parameter and feature-flag stores, EventBridge rules, and the ability to disable scheduled reconciliation jobs) and is intended to be extended with the organisation’s own payment and ledger resources.

cat > risk-integrity/integrity-attack.sh << 'EOF'
#!/usr/bin/env bash
# integrity-attack.sh - Can an attacker move money or corrupt records silently?
# Requires: aws cli v2, jq

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/integrity-findings.txt"
mkdir -p findings

emit() {
  echo "$1|$2|${ACCOUNT}|${REGION}|integrity|$3|$4|$5|$6|$7|$8|$9|${10}|${11}" | tee -a "$OUTPUT_FILE"
}

echo "=== Integrity Attack Scan: account=${ACCOUNT} region=${REGION} ===" | tee "$OUTPUT_FILE"

# 1. SSM parameters that look like routing / feature-flag / config control points
echo "[*] Checking integrity-critical parameters..."
for keyword in routing payment feature flag config ledger reconcil settlement; do
  aws ssm describe-parameters \
    --parameter-filters "Key=Name,Option=Contains,Values=${keyword}" \
    --query 'Parameters[*].Name' --output json 2>/dev/null | jq -r '.[]?' | while IFS= read -r pname; do
      emit "HIGH" "INTEGRITY_ATTACK" "$pname" \
        "Integrity-critical parameter exists; a silent value change could alter routing, flags, or reconciliation behaviour" \
        "Parameter=${pname},Keyword=${keyword}" \
        "Compromise write access -> change parameter -> reroute payments or disable a check without any outage" \
        "Money moved or records corrupted with no availability signal; discovered only at reconciliation" \
        "2880" "480" "1440" \
        "Restrict write access, enable parameter change history alerting, and require dual control for these keys"
  done
done

# 2. EventBridge rules that can be disabled (breaking automated reconciliation)
echo "[*] Checking for disablable scheduled jobs..."
for bus in $(aws events list-event-buses --query 'EventBuses[*].Name' --output text 2>/dev/null); do
  aws events list-rules --event-bus-name "$bus" \
    --query 'Rules[?ScheduleExpression!=`null`].{Name:Name,State:State}' --output json 2>/dev/null | jq -c '.[]?' | while IFS= read -r rule; do
      rname=$(echo "$rule" | jq -r '.Name')
      echo "$rname" | grep -qiE "reconcil|settlement|audit|batch|report|sweep" || continue
      emit "MEDIUM" "INTEGRITY_ATTACK" "${bus}/${rname}" \
        "Scheduled rule appears to drive a reconciliation or settlement job and can be disabled by anyone with events:DisableRule" \
        "Rule=${rname},Bus=${bus}" \
        "Compromise -> events:DisableRule -> reconciliation silently stops -> fraud goes undetected" \
        "Detective control silently removed; integrity failures accumulate undiscovered" \
        "1440" "480" "720" \
        "Alert on DisableRule for reconciliation rules and protect them via a dedicated SCP"
  done
done

# 3. Who can change Route53 records (payment endpoint repointing)
echo "[*] Checking DNS change capability for integrity impact..."
for role in $(aws iam list-roles --query 'Roles[*].{Name:RoleName,Arn:Arn}' --output json | jq -c '.[]'); do
  rname=$(echo "$role" | jq -r '.Name')
  rarn=$(echo "$role" | jq -r '.Arn')
  echo "$rname" | grep -q "^AWSServiceRoleFor" && continue
  echo "$rname" | grep -qiE "break.?glass|dns|network" && continue
  decision=$(aws iam simulate-principal-policy --policy-source-arn "$rarn" \
    --action-names "route53:ChangeResourceRecordSets" \
    --query 'EvaluationResults[0].EvalDecision' --output text 2>/dev/null || echo "error")
  if [ "$decision" = "allowed" ]; then
    emit "HIGH" "INTEGRITY_ATTACK" "$rname" \
      "Non-network role can change DNS records; payment or API endpoints could be silently repointed" \
      "Action=route53:ChangeResourceRecordSets,Decision=allowed" \
      "Compromise role -> repoint payment endpoint to attacker infrastructure -> intercept transactions" \
      "Transactions intercepted or rerouted with the platform still appearing healthy" \
      "240" "120" "240" \
      "Restrict ChangeResourceRecordSets to a dedicated DNS role and alert on hosted-zone changes"
  fi
done

echo "[*] Integrity attack scan complete."
EOF
chmod +x risk-integrity/integrity-attack.sh

22. Dependency Concentration Risk

A category that exposure scanning misses entirely is concentration. An environment can pass every configuration check and still be one failure away from total loss because everything depends on a single thing. One account, one region, one identity provider, one KMS root, one DNS provider, one CI platform. Each is individually defensible, and collectively they form a set of single points of failure where the compromise or outage of any one takes the whole business down. The October 2025 AWS regional outage demonstrated this at industry scale, taking down financial platforms whose architecture assumed a single region would always be available.

The detector produces a critical dependency score by counting how many of the foundational dependencies have no second instance. This is not a finding about a misconfiguration; it is a finding about architecture, and it is exactly the kind of structural risk that boards are equipped to fund and that engineering teams routinely defer because no single check ever flags it.

cat > risk-dependency/dependency-concentration.sh << 'EOF'
#!/usr/bin/env bash
# dependency-concentration.sh - critical_dependency_score: single points of failure
# Requires: aws cli v2, jq

set -euo pipefail

REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
OUTPUT_FILE="findings/dependency-findings.txt"
mkdir -p findings

emit() {
  echo "$1|$2|${ACCOUNT}|${REGION}|dependency|$3|$4|$5|$6|$7|$8|$9|${10}|${11}" | tee -a "$OUTPUT_FILE"
}

echo "=== Dependency Concentration Scan: account=${ACCOUNT} ===" | tee "$OUTPUT_FILE"
score=0

# 1. Single region: are production resources confined to one region?
echo "[*] Checking regional concentration..."
active_regions=0
for r in $(aws ec2 describe-regions --query 'Regions[*].RegionName' --output text); do
  count=$(aws ec2 describe-instances --region "$r" \
    --filters "Name=instance-state-name,Values=running" \
    --query 'Reservations[*].Instances[*].InstanceId' --output json 2>/dev/null | jq 'flatten | length')
  [ "${count:-0}" -gt 0 ] 2>/dev/null && active_regions=$((active_regions+1))
done
if [ "$active_regions" -le 1 ]; then
  score=$((score+1))
  emit "HIGH" "CONCENTRATION" "region" \
    "All running compute is in a single region; there is no regional failover" \
    "ActiveRegions=${active_regions}" \
    "Regional outage or region-scoped compromise -> entire platform down" \
    "Single regional event is a total outage; demonstrated at scale in October 2025" \
    "5" "0" "999999" \
    "Establish a genuine multi-region footprint for production-critical workloads"
fi

# 2. Single account dependence
echo "[*] Checking account concentration..."
in_org=$(aws organizations describe-organization --query 'Organization.Id' --output text 2>/dev/null || echo "none")
if [ "$in_org" = "none" ]; then
  score=$((score+1))
  emit "HIGH" "CONCENTRATION" "account" \
    "Workloads run in a standalone account with no organisation-level isolation between blast radii" \
    "OrganizationMember=false" \
    "Account compromise -> everything in one blast radius -> no segmentation to contain it" \
    "A single account breach reaches every workload at once" \
    "60" "120" "1440" \
    "Adopt a multi-account organisation that isolates production, security, and recovery"
fi

# 3. Single identity provider (one IAM IdP)
echo "[*] Checking identity provider concentration..."
idp_count=$(aws iam list-saml-providers --query 'SAMLProviderList | length(@)' --output text 2>/dev/null || echo "0")
oidc_count=$(aws iam list-open-id-connect-providers --query 'OpenIDConnectProviderList | length(@)' --output text 2>/dev/null || echo "0")
total_idp=$((idp_count + oidc_count))
if [ "$total_idp" -le 1 ]; then
  score=$((score+1))
  emit "MEDIUM" "CONCENTRATION" "identity-provider" \
    "Human access depends on a single federated identity provider with no break glass alternative" \
    "FederatedProviders=${total_idp}" \
    "IdP outage or compromise -> nobody can authenticate or everyone is impersonable" \
    "Loss of the IdP locks out all operators or grants an attacker universal access" \
    "30" "240" "480" \
    "Maintain a tightly controlled break glass access path independent of the primary IdP"
fi

# 4. Single KMS root for critical encryption
echo "[*] Checking KMS key custody concentration..."
cmk_count=$(aws kms list-keys --query 'Keys | length(@)' --output text 2>/dev/null || echo "0")
customer_managed=0
for k in $(aws kms list-keys --query 'Keys[*].KeyId' --output json | jq -r '.[]'); do
  mgr=$(aws kms describe-key --key-id "$k" --query 'KeyMetadata.KeyManager' --output text 2>/dev/null)
  [ "$mgr" = "CUSTOMER" ] && customer_managed=$((customer_managed+1))
done
if [ "$customer_managed" -le 1 ] && [ "$customer_managed" -gt 0 ]; then
  score=$((score+1))
  emit "MEDIUM" "CONCENTRATION" "kms" \
    "A single customer managed key underpins encryption; its loss or deletion is a single point of total data loss" \
    "CustomerManagedKeys=${customer_managed}" \
    "Key deletion or compromise -> all dependent data unreadable at once" \
    "One key is the hinge on which all encrypted data swings" \
    "60" "60" "999999" \
    "Separate keys by data domain and consider multi-Region keys with independent custody"
fi

emit "INFO" "CONCENTRATION_SCORE" "account-level" \
  "critical_dependency_score = ${score} (number of foundational single points of failure)" \
  "Score=${score}" \
  "Each point is a single failure that can take the whole business down" \
  "Higher score means lower architectural resilience" \
  "0" "0" "0" \
  "Drive this score toward zero by introducing redundancy at each foundational layer"

echo "[*] Dependency concentration scan complete. Score=${score}"
EOF
chmod +x risk-dependency/dependency-concentration.sh

23. Company Ending Events

The final family is the one that translates everything above into the single page a board will actually read. It is a set of binary tests, each phrased as a question whose answer is yes or no, and each answer of yes is a SEV1 material risk by definition. Can one action stop customer transactions. Can one identity destroy production. Can one deploy compromise every region. Can one pipeline ship malware. Can one identity delete the evidence. Can one identity disable recovery. Can one action exfiltrate customer data. These are not graded; they are existential, and the only acceptable number of yes answers is zero.

This family deliberately reuses the evaluations from the other families and rolls them up into the catastrophic verdicts. It is the layer that turns a long technical report into the sentence a CIO can take to the board: here are the seven ways this company could end this quarter, here is how many of them are currently possible, and here is what each one would cost in time and money. That sentence is the entire reason for measuring loss pathways rather than exposure.

cat > risk-catastrophic/catastrophic-events.sh << 'EOF'
#!/usr/bin/env bash
# catastrophic-events.sh - Binary company-ending tests; any TRUE is SEV1
# Requires: the other risk families to have run first (reads their findings)

set -euo pipefail

ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
REGION="${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}"
OUTPUT_FILE="findings/catastrophic-findings.txt"
mkdir -p findings

emit() {
  echo "$1|$2|${ACCOUNT}|${REGION}|catastrophic|$3|$4|$5|$6|$7|$8|$9|${10}|${11}" | tee -a "$OUTPUT_FILE"
}

echo "=== Catastrophic Event Verdicts: account=${ACCOUNT} ===" | tee "$OUTPUT_FILE"

# Helper: does any prior findings file contain a given risk_type?
finding_exists() {
  grep -hq "|$1|" findings/*.txt 2>/dev/null
}

verdict() {
  local question="$1" risk_type="$2" loss="$3" recommendation="$4"
  if finding_exists "$risk_type"; then
    emit "CRITICAL" "SEV1_MATERIAL_RISK" "catastrophic" \
      "${question} = TRUE" \
      "Backed by at least one ${risk_type} finding" \
      "See ${risk_type} findings for the specific abuse path" \
      "$loss" \
      "varies" "varies" "varies" \
      "$recommendation"
  else
    emit "INFO" "SEV1_CLEARED" "catastrophic" \
      "${question} = FALSE" \
      "No ${risk_type} finding present" \
      "n/a" "No catastrophic exposure detected in this category" \
      "0" "0" "0" \
      "Maintain the controls that keep this answer FALSE"
  fi
}

verdict "Can one identity destroy production"            "DESTRUCTIVE_CAPABILITY" \
  "Total production loss; recovery measured in hours to days or permanent" \
  "Remove destructive permissions from non break glass identities"

verdict "Can one identity disable recovery"              "RECOVERY_DENIAL" \
  "Permanent unrecoverable data loss; ransomware with no clean restore" \
  "Protect backups and keys behind SCP and isolated accounts"

verdict "Can one identity delete the evidence"           "DETECTION_BLINDNESS" \
  "Incident becomes undetectable and unprovable" \
  "Deny detection-suppression actions org-wide and alert on attempts"

verdict "Can one pipeline ship malware"                  "SUPPLY_CHAIN" \
  "Malicious signed artifact delivered to all customers" \
  "Add approval gates and split build from deploy"

verdict "Can one action bankrupt the account"            "DENIAL_OF_WALLET" \
  "Uncapped financial loss up to and beyond \$100k/day" \
  "Apply budgets, quotas, and Bedrock spend guardrails"

verdict "Can one action move money or corrupt records"   "INTEGRITY_ATTACK" \
  "Fraud or data corruption with no availability signal" \
  "Dual control and change alerting on integrity-critical resources"

verdict "Can one identity reach full account control"    "PRIVILEGE_ESCALATION" \
  "Complete account takeover from a single compromised principal" \
  "Eliminate indirect escalation paths, especially PassRole and trust-policy edits"

# Final roll-up
sev1_count=$(grep -c "|SEV1_MATERIAL_RISK|" "$OUTPUT_FILE" 2>/dev/null || echo "0")
echo ""
echo "================================================"
echo "  COMPANY-ENDING EXPOSURE: ${sev1_count} of 7 catastrophic events are currently POSSIBLE"
echo "  The only acceptable number is zero."
echo "================================================"

echo "[*] Catastrophic event scan complete."
EOF
chmod +x risk-catastrophic/catastrophic-events.sh

24. Scoring and Aggregating All Findings

Not every finding deserves the same response, and the loss pathway model gives us a far better basis for ranking than raw severity. The core formula is that risk equals blast radius multiplied by time. A finding’s base weight comes from its severity, but that weight is then amplified by how long the organisation would remain in the damaged state. A critical finding that recovers in fifteen minutes is serious; a critical finding whose recovery window is effectively permanent, such as a deleted KMS key or an unisolated backup vault, is in a different category entirely, and the scoring must reflect that. The aggregator reads both the nine field service schema and the fourteen field loss pathway schema, applies the time multiplier drawn from the recovery estimates, and then leads the report with the company ending verdicts because those are the page the board reads first.

This is the difference between a tool that produces a number and a tool that produces a decision. The example time-to-containment table that motivated this design makes the point: an admin compromise might be detected in eight minutes, contained in forty five, and recovered in three hours, whereas a backup deletion might take two hours to detect, twelve to contain, and four days to recover, and a CI compromise might run for three days before anyone notices. The same severity label hides a hundredfold difference in actual harm, and only the time dimension surfaces it.

cat > score.py << 'EOF'
#!/usr/bin/env python3
# score.py - Aggregate and score findings as blast radius x time
# Usage: python3 score.py [--findings-dir ./findings] [--output report.txt]
#
# Handles two schemas:
#   service detectors (9 fields):  severity|risk_type|account|region|service|resource|finding|evidence|recommendation
#   loss-pathway families (14):    ...|finding|evidence|abuse_path|business_loss|detect|contain|recover|recommendation

import os
import sys
import glob
import argparse
from collections import defaultdict, Counter
from datetime import datetime

SEVERITY_WEIGHTS = {"CRITICAL": 10, "HIGH": 5, "MEDIUM": 2, "LOW": 1, "INFO": 0}

RISK_TYPE_DESCRIPTIONS = {
    "DATA_LOSS": "Permanent or irreversible data loss",
    "SECURITY_EXPOSURE": "Unauthorised access or data exposure",
    "ACCIDENTAL_DELETION": "Infrastructure deleted without recovery path",
    "CRYPTOGRAPHIC_LOSS": "Encryption key loss rendering data unreadable",
    "PRIVILEGE_ESCALATION": "Path to full account control from a lesser identity",
    "PRIVILEGE_ABUSE": "Legitimate credentials capable of catastrophic actions",
    "REDUCED_VISIBILITY": "Impaired ability to detect or investigate incidents",
    "GUARDRAIL_REMOVAL": "Ability to disable the controls meant to contain an attacker",
    "RECOVERY_DENIAL": "Ability to destroy the means of recovery",
    "MACHINE_IDENTITY": "Long-lived or weakly scoped non-human credential",
    "EXTERNAL_TRUST": "Cross-account or federated trust without constraint",
    "DESTRUCTIVE_CAPABILITY": "Provable ability to destroy production",
    "DETECTION_BLINDNESS": "Ability to suppress evidence and operate unseen",
    "IRRECOVERABLE_LOSS": "Compromise leads to a state with no recovery path",
    "ENCRYPTION_HOSTAGE": "Data held hostage by deletable encryption keys",
    "SUPPLY_CHAIN": "Build/deploy pipeline compromise reaches production",
    "DENIAL_OF_WALLET": "Uncapped financial loss through resource abuse",
    "INTEGRITY_ATTACK": "Silent manipulation of money or records, no outage",
    "CONCENTRATION": "Single point of failure for the whole business",
    "CONCENTRATION_SCORE": "Count of foundational single points of failure",
    "SEV1_MATERIAL_RISK": "A company-ending event is currently possible",
    "SEV1_CLEARED": "A company-ending event is not currently possible",
}

# How long an unrecovered state is allowed to persist before it is, for scoring
# purposes, treated as effectively permanent (one week in minutes).
PERMANENT_THRESHOLD_MINS = 10080


def parse_int(value, default=0):
    try:
        return int(float(value))
    except (ValueError, TypeError):
        return default


def load_findings(findings_dir):
    findings = []
    for path in sorted(glob.glob(os.path.join(findings_dir, "*-findings.txt"))):
        with open(path, "r") as f:
            for line in f:
                line = line.strip()
                if not line or line.startswith("===") or line.startswith("[*]") or line.startswith("==="):
                    continue
                parts = line.split("|")
                if len(parts) < 9:
                    continue
                rec = {
                    "severity": parts[0],
                    "risk_type": parts[1],
                    "account": parts[2],
                    "region": parts[3],
                    "service": parts[4],
                    "resource": parts[5],
                    "finding": parts[6],
                    "evidence": parts[7],
                    "abuse_path": "",
                    "business_loss": "",
                    "detect": None,
                    "contain": None,
                    "recover": None,
                    "recommendation": parts[-1],
                    "source_file": os.path.basename(path),
                }
                # 14-field loss-pathway schema carries the extra dimensions
                if len(parts) >= 14:
                    rec["abuse_path"] = parts[8]
                    rec["business_loss"] = parts[9]
                    rec["detect"] = parse_int(parts[10], None)
                    rec["contain"] = parse_int(parts[11], None)
                    rec["recover"] = parse_int(parts[12], None)
                findings.append(rec)
    return findings


def time_multiplier(rec):
    """Risk = blast radius x time. Longer time-to-recover amplifies the score.

    Returns a multiplier in the range 1.0 (fast, fully recoverable) to 4.0
    (effectively permanent loss)."""
    recover = rec.get("recover")
    if recover is None:
        return 1.0
    if recover >= PERMANENT_THRESHOLD_MINS:
        return 4.0
    if recover >= 1440:   # more than a day
        return 3.0
    if recover >= 240:    # more than four hours
        return 2.0
    if recover >= 60:     # more than an hour
        return 1.5
    return 1.0


def score_findings(findings):
    total = 0.0
    for f in findings:
        base = SEVERITY_WEIGHTS.get(f["severity"], 0)
        total += base * time_multiplier(f)
    return round(total)


def grade(score):
    if score == 0:
        return "A - No material findings"
    elif score <= 15:
        return "B - Low material risk"
    elif score <= 45:
        return "C - Moderate material risk"
    elif score <= 90:
        return "D - High material risk"
    else:
        return "F - Critical material risk - immediate action required"


def fmt_minutes(m):
    if m is None:
        return "n/a"
    if m >= PERMANENT_THRESHOLD_MINS:
        return "effectively permanent"
    if m >= 1440:
        return f"~{m // 1440}d"
    if m >= 60:
        return f"~{m // 60}h"
    return f"~{m}m"


def render_report(findings, output_path=None):
    lines = []
    now = datetime.now().strftime("%Y-%m-%d %H:%M UTC")

    lines.append("=" * 80)
    lines.append("  MATERIAL CLOUD RISK REPORT (LOSS PATHWAY MODEL)")
    lines.append(f"  Generated: {now}")
    lines.append("=" * 80)
    lines.append("")

    if not findings:
        lines.append("  No material findings detected.")
        lines.append("")
        report_text = "\n".join(lines)
        _emit(report_text, output_path)
        return

    score = score_findings(findings)
    lines.append(f"  Risk Score (blast radius x time) : {score}")
    lines.append(f"  Grade                            : {grade(score)}")
    lines.append(f"  Findings                         : {len(findings)} total")
    lines.append("")

    # Board-facing summary first: the company-ending verdicts
    sev1 = [f for f in findings if f["risk_type"] == "SEV1_MATERIAL_RISK"]
    cleared = [f for f in findings if f["risk_type"] == "SEV1_CLEARED"]
    if sev1 or cleared:
        lines.append("=" * 80)
        lines.append("  COMPANY-ENDING EVENTS (the page the board reads)")
        lines.append("=" * 80)
        lines.append(f"  {len(sev1)} of {len(sev1) + len(cleared)} catastrophic events are currently POSSIBLE.")
        lines.append("  The only acceptable number is zero.")
        lines.append("")
        for f in sev1:
            lines.append(f"  [SEV1]  {f['finding']}")
            lines.append(f"          Loss: {f['business_loss']}")
        for f in cleared:
            lines.append(f"  [clear] {f['finding']}")
        lines.append("")

    # Then the detailed findings by severity, excluding the rolled-up verdicts
    detail = [f for f in findings if f["risk_type"] not in ("SEV1_MATERIAL_RISK", "SEV1_CLEARED", "CONCENTRATION_SCORE")]
    by_severity = defaultdict(list)
    for f in detail:
        by_severity[f["severity"]].append(f)

    for sev in ["CRITICAL", "HIGH", "MEDIUM", "LOW", "INFO"]:
        bucket = by_severity.get(sev)
        if not bucket:
            continue
        lines.append("-" * 80)
        lines.append(f"  {sev} ({len(bucket)} finding{'s' if len(bucket) != 1 else ''})")
        lines.append("-" * 80)
        for f in bucket:
            lines.append(f"  {f['service'].upper()} | {f['resource']}")
            lines.append(f"  Risk     : {f['risk_type']} - {RISK_TYPE_DESCRIPTIONS.get(f['risk_type'], '')}")
            lines.append(f"  Finding  : {f['finding']}")
            lines.append(f"  Evidence : {f['evidence']}")
            if f["abuse_path"]:
                lines.append(f"  Abuse    : {f['abuse_path']}")
            if f["business_loss"]:
                lines.append(f"  Loss     : {f['business_loss']}")
            if f["recover"] is not None:
                lines.append(f"  Time     : detect {fmt_minutes(f['detect'])} | "
                             f"contain {fmt_minutes(f['contain'])} | recover {fmt_minutes(f['recover'])}")
            lines.append(f"  Action   : {f['recommendation']}")
            lines.append("")

    # Concentration score, if present
    conc = [f for f in findings if f["risk_type"] == "CONCENTRATION_SCORE"]
    for f in conc:
        lines.append("=" * 80)
        lines.append("  ARCHITECTURAL CONCENTRATION")
        lines.append("=" * 80)
        lines.append(f"  {f['finding']}")
        lines.append("")

    # Summary tables
    lines.append("=" * 80)
    lines.append("  SUMMARY BY FAMILY")
    lines.append("=" * 80)
    for service, count in sorted(Counter(f["service"] for f in detail).items(), key=lambda x: -x[1]):
        lines.append(f"  {service.upper():<20} {count}")
    lines.append("")

    lines.append("=" * 80)
    lines.append("  HARDEST TO RECOVER (longest recovery windows)")
    lines.append("=" * 80)
    recoverable = [f for f in detail if f["recover"] is not None and f["recover"] > 0]
    for f in sorted(recoverable, key=lambda x: -x["recover"])[:10]:
        lines.append(f"  {fmt_minutes(f['recover']):<22} {f['service'].upper()} | {f['resource']}")
    lines.append("")

    _emit("\n".join(lines), output_path)


def _emit(report_text, output_path):
    if output_path:
        with open(output_path, "w") as f:
            f.write(report_text)
        print(f"Report written to {output_path}")
    else:
        print(report_text)


def main():
    parser = argparse.ArgumentParser(description="Aggregate and score AWS loss-pathway findings")
    parser.add_argument("--findings-dir", default="./findings")
    parser.add_argument("--output", default=None)
    args = parser.parse_args()

    if not os.path.isdir(args.findings_dir):
        print(f"Error: findings directory not found: {args.findings_dir}", file=sys.stderr)
        sys.exit(1)

    render_report(load_findings(args.findings_dir), output_path=args.output)


if __name__ == "__main__":
    main()
EOF
chmod +x score.py

And the top level runner that executes all scripts in sequence and produces the final report:

cat > run-all.sh << 'EOF'
#!/usr/bin/env bash
# run-all.sh - Execute all material risk detectors and produce a scored report
# Usage: ./run-all.sh [--region eu-west-1] [--profile myprofile] [--output report.txt]

set -euo pipefail

REGION="${1:-${AWS_DEFAULT_REGION:-$(aws configure get region 2>/dev/null || echo 'us-east-1')}}"
OUTPUT="${2:-material-risk-report.txt}"

export AWS_DEFAULT_REGION="$REGION"

echo ""
echo "======================================================"
echo "  Material Cloud Risk Detector"
echo "  Region  : ${REGION}"
echo "  Account : $(aws sts get-caller-identity --query Account --output text)"
echo "  Started : $(date -u '+%Y-%m-%d %H:%M UTC')"
echo "======================================================"
echo ""

rm -rf findings
mkdir -p findings
mkdir -p risk-service risk-identity risk-destructive risk-observability \
         risk-data risk-supply-chain risk-economic risk-integrity \
         risk-dependency risk-catastrophic
# Current-state service detectors establish what exists
service_scripts=(
  risk-service/ec2-risk.sh
  risk-service/rds-risk.sh
  risk-service/s3-risk.sh
  risk-service/eks-risk.sh
  risk-service/route53-risk.sh
  risk-service/kms-risk.sh
  risk-service/backup-risk.sh
  risk-service/iam-risk.sh
  risk-service/cloudtrail-risk.sh
  risk-service/vpc-risk.sh
  risk-service/secrets-risk.sh
  risk-service/guardduty-risk.sh
  risk-service/organisations-risk.sh
)

# Loss-pathway families establish abuse paths and recovery feasibility
pathway_scripts=(
  risk-identity/identity-blast-radius.sh
  risk-destructive/destructive-capability.sh
  risk-observability/detection-blindness.sh
  risk-data/data-destruction.sh
  risk-supply-chain/pipeline-compromise.sh
  risk-economic/economic-dos.sh
  risk-integrity/integrity-attack.sh
  risk-dependency/dependency-concentration.sh
)

run_script() {
  local script="$1"
  if [ -f "./${script}" ]; then
    echo "[>>] Running ${script}..."
    bash "./${script}" || echo "[!!] ${script} exited with errors - continuing"
    echo ""
  else
    echo "[--] ${script} not found, skipping"
  fi
}

echo "--- Phase 1: current state (service detectors) ---"
for script in "${service_scripts[@]}"; do run_script "$script"; done

echo "--- Phase 2: loss pathways (abuse paths and recovery feasibility) ---"
for script in "${pathway_scripts[@]}"; do run_script "$script"; done

# Catastrophic verdicts must run last because they roll up the other families' findings
echo "--- Phase 3: catastrophic verdicts ---"
run_script "risk-catastrophic/catastrophic-events.sh"

echo "======================================================"
echo "  Scoring and generating report..."
echo "======================================================"
python3 score.py --findings-dir ./findings --output "$OUTPUT"

echo ""
echo "Report written to: ${OUTPUT}"
echo "Completed: $(date -u '+%Y-%m-%d %H:%M UTC')"
EOF
chmod +x run-all.sh

25. From Current State to Recovery Feasibility

Cloud platforms already contain the evidence. The challenge is deciding what matters, and the evolution this post argues for is the move from current state to abuse path to business loss to recovery feasibility. A hygiene dashboard stops at current state and drowns the real risks in lint. An exposure scanner adds a little abuse reasoning but still ranks a misconfigured tag next to a deletable backup vault. A loss pathway detector carries every finding all the way through to the question a board actually asks, which is whether the business can survive the thing the finding describes and how long survival would take.

The seven loss pathway families are where that evolution lives. Identity blast radius treats credential compromise as the primary failure mode it has become. Destructive capability testing proves what an attacker could do rather than guessing from configuration. Detection blindness asks whether they could hide. Data destruction measures minutes to an irrecoverable state instead of confirming that backups are merely enabled. Supply chain measures what a compromised pipeline reaches. Economic denial of service puts a maximum daily burn in front of the CFO. Integrity asks whether money can move with nothing breaking. Dependency concentration counts the single points of failure that no individual check ever flags. And the catastrophic family rolls all of it into seven binary verdicts whose only acceptable answer is no.

Accorian’s cloud audit research notes that even well-architected environments accumulate risk over time in the absence of governance rigour and enforcement, which is the argument for continuous automated detection rather than periodic review. Sysdig’s eight-minute intrusion is the argument for measuring time, because detection and containment windows that were tolerable when attacks took days are fatal when they take minutes. And the October 2025 AWS outage that affected major financial platforms is the argument for measuring concentration, because the environments that recovered fastest were the ones that already knew exactly what they could survive losing and had proven their recovery paths before they needed them.

Measure what an attacker can do and how long you would take to recover, not whether a checkbox is ticked. That single shift is what turns cloud hygiene into something boards and CIOs actually fund, because it stops describing doors and starts describing what the company would lose if someone walked through them. The objective is not to prove the environment is clean. The objective is to prove the business can survive, and to know in advance, in minutes and in money, exactly how close to not surviving it currently is.

All scripts require AWS CLI v2 and jq. Run with a read only IAM role scoped to the services being scanned. The organisations script additionally requires organizations:ListPolicies, organizations:DescribeOrganization, and guardduty:ListOrganizationAdminAccounts permissions. All findings are written locally; no data leaves your environment. For further reading on AWS security risk patterns, see the Wiz AWS Security Risks guide, AWS Security Best Practices, and the AWS Well-Architected Security Pillar.