Know Your System: The Operational Cognition Questionnaire

Know Your System: The Operational Cognition Questionnaire

👁204views

Most outages are not caused by a single bug. They are caused by teams that have gradually lost a coherent mental model of the system they operate. Kubernetes retries. Queues absorb backpressure. Auto scaling masks inefficiency. Managed services hide infrastructure complexity. Over time, teams begin operating systems they no longer fully understand. Then something breaks.

At that moment, the difference between a resilient team and a fragile one becomes visible. Strong teams immediately orient themselves. They know where authentication happens. They know which dependencies are synchronous. They know what state lives in memory versus persistent storage. They know where the logs are, which dashboards matter, which feature flags were recently changed, and which components are safe to restart. Weak teams search for their own architecture diagrams while production burns.

This questionnaire is designed to surface that gap before an outage does. It is not a compliance exercise. It is a test of operational consciousness under pressure. The audit should never be scheduled in advance. The moment preparation becomes possible, teams optimize for theatre instead of truth. Outages do not send calendar invites.

How to Use This Questionnaire

Answer each question from memory. Where evidence is required, produce the artefact live during the session. A correct answer that cannot be demonstrated is not ownership, it is folklore. Evidence markers are called out explicitly throughout each section.

1. System Topology

Can the team describe the actual runtime architecture without consulting diagrams?

Live exercise: Whiteboard the full request path for a customer transaction from entry point to database. Do not reference existing diagrams. Compare the result to production reality.

Evidence required: The completed whiteboard versus an actual architecture diagram or service mesh topology export. Discrepancies are findings.

  1. What is the complete request path for a customer transaction?
  2. Which components in that path are synchronous and which are asynchronous?
  3. Which systems must be available for a user login to succeed?
  4. Which systems are optional but degrade the experience if unavailable?
  5. Which components hold state, and what kind of state?
  6. Which components can be safely restarted without customer impact?
  7. Which services communicate over HTTP, gRPC, internal messaging, or direct database calls?
  8. Which dependencies cross region or availability zone boundaries?
  9. Which dependencies are internet reachable?
  10. Which services are hardest to scale under sudden load?
  11. What happens to inflight transactions if the primary load balancer fails?
  12. Which part of the topology has changed most recently?

Failure indicators: Team argues over traffic flow. Engineers confuse logical architecture with deployment architecture. Nobody can name where state lives. Dependency chains are incomplete. The whiteboard bears no resemblance to production.

2. DNS and Certificate Management

Does the team understand how name resolution and trust chain validity underpin every connection?

Live exercise: Identify every external and internal DNS name the system depends on. List certificate expiry dates for all public and internal TLS endpoints from memory, then verify against reality.

Evidence required: A current certificate inventory with expiry dates. Demonstrate a live check against at least one critical endpoint. If no inventory exists, that is a finding.

  1. Which DNS names are critical to the system’s operation?
  2. Who controls the DNS zones those names live in?
  3. What is the TTL for your primary service DNS records?
  4. What happens to the system if internal DNS resolution fails?
  5. What happens if an external DNS provider has an outage?
  6. Are there any DNS records pointing to decommissioned infrastructure?
  7. Which TLS certificates are in use across public endpoints?
  8. Which TLS certificates are in use on internal service-to-service connections?
  9. When does the next certificate expire?
  10. Who owns certificate renewal, and is the process automated?
  11. What is the alerting threshold before a certificate expiry triggers a notification?
  12. What happens if a certificate expires on a Saturday morning?
  13. Where are certificates stored and managed: in a secrets manager, a certificate manager service, on disk, or somewhere else entirely?
  14. How is the full rotation process performed for a public-facing certificate, from generating a new certificate to retiring the old one, and has anyone done it recently?
  15. How do you test that a certificate rotation succeeded without causing a production outage?
  16. Have wildcard certificates been issued, and where are they used?
  17. Is certificate pinning used anywhere, and what is the rotation plan?
  18. Which systems perform certificate revocation checking, and do they fail open or fail closed?

Failure indicators: No certificate inventory exists. Renewal is manual and undocumented. Nobody knows when the next expiry is. Internal service certificates are untracked. Certificates live on individual servers with no central record. Nobody has performed a rotation recently enough to remember the steps. Alerts fire at 30 days but the renewal process takes 45.

3. Authentication, Identity, and Account Lockout

Does the team understand trust boundaries, token lifecycle, and what happens when authentication pressure or failure propagates through the system?

Live exercise: Trace a complete authentication event from credential submission to session establishment. Then describe what happens when that same user submits incorrect credentials five times in succession.

Evidence required: Document or diagram showing token signing configuration, session storage location, and the account lockout policy with threshold values. Demonstrate where lockout state is stored and how it is cleared.

  1. Where does authentication terminate in the stack?
  2. Where is token validation performed, and by which components?
  3. Which systems trust JWTs directly without further validation?
  4. What signing algorithm is used, and what is the key rotation schedule?
  5. How are signing keys rotated without dropping active sessions?
  6. What happens if the identity provider becomes unavailable?
  7. What happens if token introspection latency increases from 20ms to 4 seconds?
  8. Where are sessions stored?
  9. What is the session expiry policy?
  10. How are service-to-service identities managed?
  11. Which systems can impersonate a customer?
  12. Which systems enforce authorization as distinct from authentication?
  13. What is the account lockout threshold for failed authentication attempts?
  14. Where is lockout state stored?
  15. What is the lockout duration, and is it fixed or progressive?
  16. How is a legitimate user unlocked: automatically, by support staff, or by selfservice flow?
  17. Can a lockout be triggered remotely by an attacker with knowledge of a username?
  18. Is there rate limiting on authentication endpoints independent of account lockout?
  19. What happens if the lockout state store becomes unavailable?
  20. Are administrative and service accounts subject to the same lockout policy as customer accounts?

Failure indicators: Teams confuse authentication with authorization. Token lifecycle is unclear. Lockout thresholds are unknown or inconsistently applied. Lockout state has no fallback behaviour. Service accounts have no lockout policy at all.

4. Data Flow and Persistence

Does the team understand how data moves, where it survives, and what happens when writes fail midflight?

Scenario: Your primary database becomes readonly for 20 minutes. What continues working? What silently fails? What corrupts?

Evidence required: Identify which queues or topics exist and demonstrate visibility into their current depth and consumer lag. Show the dead letter queue configuration for at least one critical flow.

  1. Which database is the source of truth for customer account state?
  2. Which data is eventually consistent, and which is strongly consistent?
  3. Which writes are transactional, and which are fire and forget?
  4. What happens if the message queue or event broker becomes unavailable?
  5. What is the replay strategy for failed events?
  6. How is idempotency enforced on consumers?
  7. Which systems cache data, and how stale can that cache become?
  8. Which systems perform distributed writes across more than one data store?
  9. How are dead letters handled, and who monitors the dead letter queue?
  10. Which systems are vulnerable to duplicate message processing?
  11. What happens during a partial commit failure across a distributed transaction?
  12. How is schema evolution managed, and who approves breaking changes?

Failure indicators: Teams assume queues are infinite. Nobody understands replay semantics. Consistency guarantees are vague. Dead letter queues are not monitored. Schema changes happen without migration plans.

5. Observability

Can the team see reality, or are they navigating by assumption?

Live exercise: Inject a synthetic latency spike or error rate increase into a nonproduction environment. Observe how the team diagnoses it. Do they reach for logs, traces, metrics, or guesswork first?

Evidence required: Open the primary operational dashboard live. Identify which panels represent genuine health signals versus decorative metrics. Show a complete trace for a customer transaction. Demonstrate a log query that correlates a single request across all services.

  1. Where are application logs stored, and what is the retention period?
  2. Which correlation ID ties a single transaction across all services?
  3. Can you trace a single customer request from entry to database and back?
  4. Which metrics define whether the system is healthy right now?
  5. Which alerts are actionable versus persistently noisy?
  6. What is your single most important operational dashboard?
  7. Which alert indicates total system failure?
  8. Which failure modes generate no alert at all?
  9. Which services have incomplete or absent telemetry?
  10. How long would it take to identify the origin of a latency increase affecting 5% of requests?
  11. Can the team identify a failing downstream dependency in under five minutes?
  12. Are logs structured or unstructured, and can they be queried by customer identifier?
  13. Is distributed tracing implemented end to end, or only within individual services?
  14. Are business events such as transactions submitted, payments processed, and accounts created observable separately from infrastructure events?
  15. When did a dashboard last mislead the team during an incident?

Failure indicators: Logs are searched randomly without a query strategy. Dashboards display metrics nobody acts on. Alerts exist without a named owner. Distributed traces stop at service boundaries. Business events are invisible to operations.

6. Failure Mode Understanding

Does the team understand how their system dies?

Challenge: Describe a realistic path to a complete outage starting from a single component failure. If the answer is “that cannot happen,” the audit is already failed.

Evidence required: A written or whiteboarded failure mode map identifying the three most dangerous single points of failure and their blast radius. If none exists, produce it during the session.

  1. What is the most likely outage scenario in the next 90 days?
  2. What is the most dangerous silent failure, the one that corrupts data or loses events without alerting?
  3. Which single dependency creates the largest blast radius if it degrades?
  4. What causes cascading retry storms in this system?
  5. What happens during connection pool exhaustion?
  6. What happens during thread starvation in a highconcurrency service?
  7. What happens during disk pressure on a database node?
  8. Which components fail open, and which fail closed?
  9. Which failures can corrupt customer data?
  10. Which failures are irreversible without a restore from backup?
  11. What operational condition worries the team most right now?
  12. What near-miss occurred in the last six months that was never formally reviewed?

Failure indicators: The team cannot name a realistic outage path. Silent failures are unacknowledged. Blast radius analysis has never been done. Near-misses produced no documented learning.

7. Recovery Capability

Can the team restore the system, or does recovery depend on heroics?

Live exercise: Name the steps to restore the system from a complete failure of the primary database. Identify who performs each step, where the credentials are, and how long each step takes.

Evidence required: A runbook for at least one critical recovery procedure. Evidence that a restore from backup has been performed within the last 90 days. Access logs showing who currently holds production credentials.

  1. How long would full recovery actually take, measured from detection to full service restoration?
  2. Who has production access right now?
  3. Can the system be rebuilt entirely from source and configuration, without manual steps?
  4. Where are secrets stored, and who has access to the secrets manager?
  5. How are backups verified as restorable?
  6. When was a production restore last tested?
  7. Which systems require manual intervention during recovery, and is that intervention documented?
  8. Which systems are undocumented and owned only in someone’s memory?
  9. What happens if the senior engineer is unreachable during a major incident?
  10. Which recovery steps carry risk of causing further damage if performed incorrectly?
  11. Which recovery steps are irreversible?
  12. What does the oncall engineer do in the first five minutes of a P1?

Failure indicators: DR exists only in presentations. No restore has been tested recently. Production access is unclear. Recovery depends on one person. Runbooks describe what to do without describing how.

8. Security Posture

Does the team understand the attack surface they are operating?

Evidence required: A current list of internet-exposed endpoints with associated authentication requirements. Evidence of the most recent penetration test finding and its remediation status.

  1. Which endpoints are internet reachable without authentication?
  2. Which endpoints accept unauthenticated traffic for legitimate reasons?
  3. Where are secrets stored, and when were they last rotated?
  4. Which services run with elevated privileges, and why?
  5. How is inbound traffic validated or sanitised at the entry point?
  6. Which dependencies have known vulnerabilities in the current version deployed?
  7. What is the process for rotating a compromised credential in production?
  8. Which systems log authentication failures, and where do those logs go?
  9. Is there network segmentation between customerfacing services and internal systems?
  10. What does a successful credential stuffing attack look like in your telemetry?

Failure indicators: Internet-exposed endpoints are not fully enumerated. Secrets are stored in environment variables or source repositories. Elevated privileges are unexplained. Compromised credential rotation has no documented procedure.

9. Operational Ownership

Does the team behave like owners, or like maintainers of someone else’s system?

Evidence required: Named owners for uptime, latency SLOs, cost, and security posture. Evidence that SLO metrics are reviewed on a regular cadence. A list of incidents from the last quarter with documented postincident learning.

  1. Who owns uptime, and who is accountable when the SLO is breached?
  2. Who owns latency, and who acts when p99 degrades?
  3. Who owns infrastructure cost, and who responds when spend spikes?
  4. Who owns security posture for this system?
  5. Who communicates with customers during an outage?
  6. Who approves a risky release?
  7. Who decides rollback criteria before a release goes out?
  8. Which metrics are reviewed on a fixed weekly or fortnightly cadence?
  9. Which incidents in the last quarter produced documented learning and changed behaviour?
  10. Which incidents produced no change?
  11. What operational debt exists today and what is the remediation plan?
  12. What part of the system worries the team most, and what is being done about it?

Failure indicators: Ownership diffuses across teams with no single accountable person. Dependencies are blamed reflexively. Postincident reviews are written and filed without producing changed behaviour. Operational debt is acknowledged but unscheduled.

10. Voice Under Pressure

Does everyone who understands the system feel safe enough to say so during an incident?

Technical knowledge is only operationally useful if it can be spoken aloud at the moment it is needed. This is not a soft skills observation. It is an operational risk. A team can contain the person who knows exactly what is wrong while that person stays silent because a senior engineer is already committed to a wrong hypothesis, because the incident bridge has twenty people on it and speaking up feels impossible, or because past incidents punished the person who was right but contradicted someone more senior. The knowledge existed. The outage continued anyway.

This section audits whether the conditions for speaking up actually exist on the team, not whether people claim they do.

Live exercise: During a simulated incident or postincident review, ask the most junior engineer present to contradict the most senior. Observe what happens. If it never occurs naturally, that is the finding.

Evidence required: Postincident timelines identifying when the correct diagnosis was first raised and by whom. If the correct hypothesis was raised late in the incident by someone junior, ask why it took that long to surface.

  1. In the last major incident, who first identified the root cause?
  2. How long after they identified it did they say it out loud?
  3. Was there a period during the incident where someone knew something relevant but did not speak?
  4. How does the team handle a situation where a junior engineer believes the incident commander has the wrong hypothesis?
  5. Has anyone ever stayed silent during an incident and later turned out to have known the answer?
  6. What does the incident bridge sound like when things are going badly: is it one voice, or many?
  7. Is it safe to say “I think we are looking at the wrong thing” to the most senior person in the call?
  8. Are postincident reviews psychologically safe enough for people to say they were confused or did not understand the system at a critical moment?
  9. Does the team have an explicit norm for how to interrupt or redirect during a live incident?
  10. Have any engineers left the team whose departure removed critical system knowledge that nobody else held?

Failure indicators: Root cause was known before it was spoken. Junior engineers consistently defer rather than contribute. Incident calls are dominated by one or two voices. Postincident reviews attribute delay to tooling or process rather than communication failure. The team describes itself as having good psychological safety but cannot name a recent example of someone successfully contradicting a senior engineer under pressure.

11. Fractured Ownership and Organisational Failure

Does the team structure reflect the system’s actual complexity, or has the system been fragmented across teams in a way that guarantees nobody understands the whole?

When ten or more teams appear on an incident call, the architecture has already failed. That is not a collaboration model. It is evidence that the system has been divided beyond the point where any individual team can reason about end-to-end behaviour, and that the organisation has mistaken narrow vertical ownership for engineering rigour. What follows is predictable: every team can demonstrate their own component is functioning correctly while the customer experience is completely broken, and the incident becomes a political negotiation about fault rather than a technical exercise in diagnosis and recovery.

Fractured ownership is rarely the result of malicious decisions. It accumulates through years of Conway’s Law operating without correction. Teams are formed around delivery capability rather than operational responsibility. Microservices proliferate beyond the point of coherence. Platform and product boundaries multiply. Nobody ever draws the line and says this system now has more owners than any one person can hold in their head, and that is a problem we must fix. By the time the outage arrives, fifteen teams each own a slice so thin that none of them can be held accountable for the whole, and all of them have a coherent argument for why the fault lies elsewhere.

This section asks the questions that surface that condition before the next incident exposes it.

Evidence required: The incident bridge attendance log from the last three major incidents. If more than five teams were represented, ask why. A current list of named owners for the top ten customer-impacting user journeys, not services, but journeys. If that list does not exist or cannot be produced in five minutes, it is a finding.

  1. How many teams were on the last major incident call, and what did each of them contribute?
  2. If the answer to question 1 is more than five, what does that tell you about how ownership is structured?
  3. Can a single team describe the complete path of the most critical customer journey end to end, including every system that touches it?
  4. Which team is accountable when a customer journey fails and the fault spans more than one service boundary?
  5. Has any incident in the last year concluded with no team accepting ownership of the root cause?
  6. How many teams need to approve a change that affects end-to-end latency for the primary customer flow?
  7. Which team would you wake up at 3am for a complete customer-facing outage? If the answer is unclear or contested, that is the finding.
  8. Are there services in production today that are owned by a team that no longer exists or has been reorganised away?
  9. How many handoffs does a customer request cross before it completes, and does any single team understand all of them?
  10. When the last postincident review assigned action items, how many different teams received them, and how many of those items were completed?
  11. Is the current team structure a deliberate design decision, or is it the accumulated result of reorganisations that nobody has revisited?
  12. If you were to redesign team ownership from scratch around operational accountability rather than delivery velocity, what would change?

Failure indicators: More than five teams on a routine incident call. No single team can describe a complete customer journey. Ownership of crossservice failures is genuinely unclear rather than temporarily disputed. Postincident action items are distributed across so many teams that nobody follows up. Teams describe their scope as “our service” rather than “our customer outcome.” The organisation has optimised for independent deployment at the cost of coherent operational ownership. Architects and engineering managers cannot agree on which team is responsible for endtoend reliability.

Score each section from 0 to 5.

ScoreMeaning
0No operational understanding
1Fragmented tribal knowledge held by one or two individuals
2Partial awareness that degrades under pressure
3Functional ownership: the team can operate the system
4Strong operational cognition: the team can reason about failure
5Deep systems mastery: the team can anticipate failure before it occurs

A score of 3 or above across all sections is the minimum threshold for a team operating a production system that holds customer trust.

Final Observation

The most dangerous systems are not the ones with bugs, because every system has bugs. The dangerous systems are the ones where operators no longer possess a coherent mental model of what they are running. Those teams often mistake deployment frequency for engineering maturity, while slowly accumulating invisible operational illiteracy beneath the surface.

A team that truly knows its system can explain how it works, how it scales, how it fails, how it recovers, and how customers experience degradation before telemetry even confirms it. That level of understanding is not documentation. It is operational consciousness.