The Quiet Power of Free Tier: Why Cloudflare Gets It Right

By Andrew Baker, CIO at Capitec Bank

There is a truth that most technology vendors either do not understand or choose to ignore: the best sales pitch you will ever make is letting someone use your product for free. Not a watered-down demo, not a 14-day trial that expires before anyone has figured out the interface, but a genuinely generous free tier that lets people build real things and solve real problems. Cloudflare understands this better than almost anyone in the industry right now, and it has made me a genuine advocate in a way that no amount of marketing spend ever could.

1. How I Found Cloudflare and Almost Lost It

My journey with Cloudflare did not begin with enthusiasm. It began at Capitec, where I was evaluating infrastructure and security platforms at institutional scale. My initial view of Cloudflare was limited: it was a CDN with an API gateway capability, useful, but not architecturally differentiated in any meaningful way from competing options. My awareness of what genuinely set it apart was low.

The concerns I had at that stage were squarely enterprise concerns. The lack of private peering between Cloudflare and AWS in South Africa was a meaningful issue for Capitec specifically. For a major retail bank operating in this market, network latency and peering and routing issues are not abstract considerations. They are hard requirements. The absence of a direct peering arrangement had me questioning whether Cloudflare could credibly serve the needs of a bank with millions of active customers.

Then came a series of outages in 2025. Any one of those incidents in isolation might have been forgivable, but cumulatively they put Cloudflare in a difficult position. For a platform whose core value proposition is reliability and availability, sustained turbulence shakes confidence.

What changed my perspective was not a sales conversation or an analyst briefing. It was personal experimentation. I started using Cloudflare for andrewbaker.ninja, my personal blog, after joining Capitec. That hands-on use opened up a completely different view of the platform. What I had evaluated as a CDN with an API gateway was actually something far more capable. I discovered R2, Cloudflare’s object storage offering. I worked through Workers in depth. I started building real functionality at the edge, not just routing traffic through it. Most significantly, our team began using Cloudflare Workers to create custom malware signals and block traffic based on behavioural patterns, turning what I had thought of as a passive network layer into an active security enforcement point.

That is the moment the evaluation changed. The peering concerns and the stability questions remained live issues, but I now had genuine product depth that allowed me to weigh them against a much clearer picture of Cloudflare’s architectural differentiation. That picture came entirely from free tier experimentation on a personal blog. It could not have come from a sales deck.

2. What Cloudflare Actually Gives You for Free

The Cloudflare free tier is, frankly, extraordinary. When I first started using it for andrewbaker.ninja, I expected the usual pattern: enough capability to see the shape of the product, but with enough gates and limits to push you toward a paid plan. What I found instead was a comprehensive platform that covers almost every dimension of modern web security and performance at zero cost.

2.1 Security and Performance at the Edge

The foundation of the free tier is unmetered DDoS mitigation. Not capped, not throttled after a threshold, unmetered. For a personal blog or small business site, volumetric attacks are existential threats, and the fact that Cloudflare absorbs them at no cost is a remarkable statement of confidence in their own network scale. Sitting on top of that is a global CDN spanning over 300 cities, with free tier users on the same edge infrastructure as enterprise customers. SSL is automated, free, and renews without any manual intervention, making the secure default the effortless default. Five managed WAF rules covering the most critical OWASP categories are included, along with basic bot protection that handles the constant noise floor of scrapers, credential stuffers, and scanning bots that any public site attracts.

Caching deserves particular attention because for anyone running on a low end AWS instance type, and most personal blogs do exactly that, it is not a nice to have. It is life or death for the origin server. A t3.micro or t4g.small running WordPress has a hard ceiling. Under normal traffic patterns it holds up, but a post shared on LinkedIn with any momentum or picked up by a newsletter will send concurrent requests that a small instance simply cannot absorb. With Cloudflare caching absorbing the majority of that traffic, the origin barely notices the spike. I have watched this play out against andrewbaker.ninja more than once. The cache hit ratio in the analytics dashboard tells the story clearly: the origin handles a fraction of total requests while Cloudflare absorbs the rest. That is an availability and cost story simultaneously. Cache rules, custom TTLs, per-URL purging, and intelligent handling of query strings and cookies are all available on the free tier, giving you a degree of control that is not normally associated with a free offering.

2.2 Developer Capability and Operational Visibility

Beyond security and performance, the free tier extends into territory that genuinely surprises. Workers gives you serverless compute at the edge with 100,000 requests per day included, which is more than enough to build meaningful functionality: request transformation, custom authentication flows, A/B testing, and API proxying. In our case, it became a platform for building custom malware detection signals and traffic blocking logic that goes well beyond what a conventional WAF configuration could achieve. Cloudflare Pages adds free static site hosting with unlimited bandwidth and up to 500 builds per month, competitive with the best JAMstack platforms. DNS management sits on infrastructure widely regarded as the fastest authoritative DNS in the world, with DNSSEC and a clean management interface included at no cost.

The analytics layer is where Cloudflare makes a particularly interesting choice. Rather than gating visibility behind paid plans to obscure the value being delivered, the free tier shows you everything: requests, bandwidth, cache hit ratios, threats blocked by type, geographic traffic distribution, and real user Web Vitals data including Largest Contentful Paint and Cumulative Layout Shift from actual visitor sessions. For andrewbaker.ninja, the geographic breakdown alone was genuinely new information that shaped content decisions. Seeing threats blocked in real time makes the protection layer concrete rather than theoretical. Zero Trust Access rounds out the free offering with up to 50 users, giving hands-on experience with a ZTNA model that enterprise vendors charge significant per-user premiums to access.

One area where I would encourage Cloudflare to go further is 404 error tracking, which currently sits behind paid plans. A limited version tracking errors for just a handful of pages would cost them very little while giving free tier users a direct experience of the capability. The broader principle I would advocate is that every service in the Cloudflare catalogue should have at least a small free window. Exposure drives understanding, understanding drives advocacy, and advocacy drives enterprise pipeline far more reliably than any campaign.

3. The Strategic Value of Free Tier as a Leadership Development Tool

Let me be direct about what actually happened here. Cloudflare was already on my radar at Capitec, evaluated cautiously and with real reservations. What the free tier did was deepen my product knowledge far beyond what any enterprise evaluation process produces. I moved from understanding Cloudflare as a CDN with an API gateway to understanding it as a programmable edge platform with genuine security enforcement capability. That shift happened entirely through personal experimentation, at zero cost to Cloudflare beyond the infrastructure they were already running.

No sales team call produced that outcome. No analyst briefing, no conference sponsorship, no whitepaper. A free tier account for a personal blog did.

This is not a coincidence or a lucky edge case. It is the mechanism by which free tier compounds in value over time in ways that are almost impossible to model but entirely real. The person experimenting with your product on a side project today is accumulating product knowledge that travels with them across every context in which they operate, personal and professional simultaneously. When that person holds senior leadership responsibility, the intuitions built through free tier experimentation inform how they frame requirements, assess vendor claims, and evaluate architectural trade-offs. Crucially, that knowledge also provides resilience when a platform goes through a difficult period. I stayed with Cloudflare through the 2025 stability issues not because of a reassuring account manager call but because my own hands-on depth gave me enough architectural confidence to make an informed judgment rather than a reactive one.

The same pattern holds with AWS. My understanding of AWS architecture was built significantly through free tier experimentation. The 12 months of free tier access that AWS provides across a substantial catalogue of services is one of the smartest investments they have made in their developer ecosystem. My seven AWS certifications represent formal validation of knowledge that was built largely through hands-on experimentation the free tier enabled. When I evaluate AWS proposals at Capitec or advocate for specific AWS architectural patterns, that credibility traces back to free tier experience. No marketing budget produces that outcome.

Free tier products are, in effect, a leadership development programme that technology vendors run at their own expense. Every future CIO, CTO, or technology decision maker working their way up through an organisation is building instincts and preferences right now through the products they can access and experiment with freely. The vendors who understand this invest in those experiences. The vendors who do not are optimising for short-term revenue extraction at the cost of long-term pipeline development.

4. The Slack Cautionary Tale

Slack represents the opposite lesson, and it is worth examining honestly.

I used Slack’s free tier heavily for years. Across multiple communities, interest groups, and peer networks, Slack was the default platform precisely because the free tier was generous enough to make it viable for groups that could not or would not pay. It was through this extensive free tier use that I developed deep familiarity with the product, its integrations, its workflow automation capabilities, and its organisational model. That familiarity translated directly into Slack advocacy in enterprise contexts.

Then came a series of changes to the free tier. Message history limits became more restrictive. Integration constraints tightened. The experience of being a free tier user shifted from feeling like a valued participant in the platform ecosystem to feeling like someone being actively nudged toward payment.

The result was not that the communities I participated in upgraded to paid Slack. The result was that those communities moved to other platforms. Discord absorbed many of them. Some moved to Microsoft Teams. Others fragmented across different tools. In most cases the community did not reconstitute on Slack at a paid tier. It simply left.

The downstream consequence for Salesforce, which acquired Slack for approximately 27.7 billion dollars, is a meaningful erosion of exactly the pipeline that free tier usage was building. Every community organiser, technology professional, and business leader who built their Slack intuitions through free tier usage and then migrated to an alternative platform is now building comparable depth of knowledge on a competing product. The future enterprise purchasing decisions of those individuals will reflect that. Slack did not just lose free tier users. It cut off future sales pipeline development at the roots.

This is a cautionary tale that should sit prominently in the strategic planning conversations of any technology company considering changes to their free tier offering. The immediate revenue signal from restricting free tier is misleading. The long-term signal, which is harder to measure and slower to manifest, is the erosion of informed advocacy and the diversion of future decision makers toward alternatives.

5. Rethinking the Marketing Mix

I hold a view that is probably uncomfortable for most marketing organisations: technology companies should meaningfully reduce marketing spend in favour of free tier investment.

I understand why this is a hard argument to make internally. Marketing spend produces attributable metrics. Pipeline influenced, leads generated, impressions delivered. Free tier investment produces outcomes that are diffuse, long horizon, and resistant to attribution. The CIO who advocates for your platform in a 2028 procurement decision because they built something meaningful with your free tier in 2024 is almost impossible to trace back to that original free tier investment in any marketing analytics framework.

But the influence is real and it is durable in a way that no campaign achieves. You can say anything you want about a product through marketing. You can claim reliability, performance, security posture, developer experience, and operational simplicity until every available channel is saturated. None of it carries the weight of having used the product yourself, watched it perform under real conditions, seen it recover from real failures, and built genuine intuition about its architectural strengths and constraints.

There is also a fundamental misunderstanding embedded in how many enterprise technology vendors think about who actually buys their products. Most enterprise software is not bought by lawyers or sourcing teams. It is bought by engineers. Sourcing teams negotiate contracts and lawyers review them, but the decision about which platform gets shortlisted, which architecture gets proposed to leadership, and which vendor gets championed internally is made by the technical people who will live with the choice. Those people make their recommendations based on product knowledge, hands-on experience, and the intuition that comes from having actually built something with the technology. Embedding that knowledge in the market is not a nice to have. It is the primary sales motion, whether vendors recognise it or not. Every engineer who has meaningful free tier experience with your product is a potential internal champion in a future procurement cycle. Every engineer who has never touched your product, because the access gate was too high, is not.

Cloudflare has clearly internalised this. Their free tier is not a reluctant concession to market norms. It is a deliberate investment in developing the next generation of platform advocates. The breadth of capability they make available at no cost, spanning network security, edge compute, DNS, analytics, and Zero Trust access, reflects a confidence that the product will demonstrate its own value to the people who use it. That confidence is justified. It worked on me, though not in the way a typical marketing funnel would predict or model.

6. Conclusion

Free tier products close the distance between description and experience. They are the most honest form of marketing because they are not marketing at all. They are just the product, made accessible.

For Cloudflare, the free tier fundamentally changed how I understand the platform. I came in seeing a CDN with an API gateway. Personal experimentation with Workers, R2, and custom edge security logic revealed an architecture that is genuinely differentiated. The enterprise concerns around peering and the 2025 stability issues remained real, but the product depth I had built through free tier use meant those concerns could be weighed against a much clearer picture of what Cloudflare actually is at a platform level. That is a completely different evaluation from the one I would have made without it.

For Slack, the contraction of free tier generosity has had the opposite effect, redirecting communities and the professional development of their members toward competing platforms in ways that will compound as career trajectories advance.

The lesson is straightforward even if the organisational will to act on it is not. Invest in free tiers. Invest generously. The future pipeline you are building is less visible than the one your sales team can point to today, but it is deeper, more durable, and ultimately more valuable. Let people experience your product. Trust that it is good enough to speak for itself. If it is not, that is the more important problem to solve.


Andrew Baker is the Chief Information Officer at Capitec Bank in South Africa. He writes about enterprise architecture, cloud infrastructure, banking technology, and leadership at andrewbaker.ninja.

Business Heads: Technology Leadership Competence Assessment

This is an assessment. It is not balanced. It is not here to validate your instincts, your planning methodology, or your confidence in the delivery framework you inherited. It exists to surface how you actually think about technology leadership when you are deciding whether to trust an engineer, approve a pivot, or override a technical warning to protect a timeline.

Answer honestly. Not as the executive you present in interviews. As the leader you become when the deadline is real, the team is pushing back, and someone senior is asking you for certainty you do not have.

Every option is phrased to sound reasonable, responsible, and professionally defensible. That is the point. The wrong answers are rarely stupid. They are comfortable.

How to Score Yourself

🟢 Strong technology leadership instinct – demonstrates systems thinking, quality, sustainability, and genuine respect for engineering as a discipline
🟡 Acceptable but surface level – not wrong, but reveals a preference for process, optics, or a management lens over a technology leadership lens
🔴 Concerning – reveals a fixation on timelines, revenue, reporting ceremony, or a belief that technologists are execution resources who should deliver rather than think

After answering all questions, count how many 🟢, 🟡, and 🔴 answers you selected. Then read the interpretation at the end.

Questions:

1. A Major Platform Decision Was Approved Six Months Ago

New evidence suggests it may be the wrong choice. What do you do?

A. Revisit the decision with the new evidence and recommend a course correction even if it causes short term disruption
B. Flag the concern but continue execution since the committee already approved it and reversing would delay the programme
C. Raise it informally but keep delivery on track since the timeline commitments to the board cannot slip
D. Continue as planned because reopening approved decisions undermines confidence in the governance process

2. Your Team Proposes Removing an Integration Layer

It will reduce complexity but invalidate three months of another team’s work. How do you proceed?

A. Protect the other team’s work and find a compromise that keeps both approaches since we need to respect the investment already made
B. Evaluate the simplification on its technical merits regardless of sunk cost and proceed if the outcome is better for customers
C. Delay the decision until next quarter’s planning cycle so it can be properly socialised across all stakeholders
D. Proceed only if the simplification can be shown to accelerate the current delivery timeline

3. You Inherit Seven Management Layers Between CTO and Engineers

What is your first instinct?

A. Understand why each layer exists and remove any that do not directly contribute to decision quality or delivery outcomes
B. Add a dedicated delivery management function to coordinate across the layers more effectively
C. Maintain the structure but introduce better reporting dashboards so you can see through the layers
D. Restructure the layers around revenue streams so each layer has clear commercial accountability

4. What Is the Primary Purpose of a Technology Strategy Document?

A. To secure budget approval by demonstrating alignment between technology investments and projected revenue growth
B. To reduce uncertainty by clarifying what the organisation will and will not build, and why
C. To provide a roadmap with delivery dates that the business can hold the technology team accountable to
D. To communicate the technology vision to non technical stakeholders in a way they find compelling

5. What Does Blast Radius Mean in Systems Architecture?

A. The scope of impact when a single component fails, and how far the failure propagates across dependent systems
B. The amount of data lost during a disaster recovery event before backups can be restored
C. The total number of customers affected during a planned maintenance window
D. The financial exposure created by a system outage, measured in lost revenue per minute

6. When Designing a Critical System, What Is Your Primary Architectural Concern?

A. Ensuring the system can scale to meet projected revenue targets for the next three years
B. Designing for graceful failure so the system degrades safely rather than failing catastrophically
C. Selecting the vendor with the strongest enterprise support agreement and SLA guarantees
D. Ensuring the architecture aligns with the approved enterprise reference model and standards

7. What Does It Mean to Design a System Assuming Breach Will Happen?

A. Building layered defences, monitoring, and containment so that when a breach occurs the damage is limited and detected quickly
B. Purchasing comprehensive cyber insurance to cover the financial impact of a breach event
C. Conducting annual penetration tests and remediating all critical findings before the next audit cycle
D. Ensuring all systems are compliant with the relevant regulatory frameworks and industry standards

8. A Project Is Behind Schedule

The team suggests reducing scope to meet the deadline. The business stakeholder wants the full scope delivered on time. What do you recommend?

A. Deliver the reduced scope with high quality and iterate, since shipping broken software on time is worse than shipping less software that works
B. Add additional resources to accelerate delivery since the business committed to the date with external partners
C. Negotiate a two week extension with the full scope since the revenue impact of a delayed launch is manageable
D. Split the team to deliver the core features on time and the remaining features two weeks later as a fast follow

9. How Should Work Ideally Flow Through a Well Functioning Technology Team?

A. Through two week sprints with defined ceremonies, backlog grooming, sprint reviews, and retrospectives
B. Through continuous small changes deployed frequently with clear ownership and minimal handoffs
C. Through quarterly planning cycles with monthly milestone reviews and weekly status reporting
D. Through a prioritised backlog managed by a product owner who coordinates with the business on delivery sequencing

10. A Team Is Delivering Features on Time but Production Incidents Are Increasing

What does this tell you?

A. The team is likely cutting corners on quality to meet deadlines and the delivery metric is masking a growing technical debt problem
B. The team needs better production support tooling and a dedicated site reliability function
C. The team is delivering well but the infrastructure team is not scaling the platform to match the increased feature throughput
D. The incident management process needs improvement since faster triage would reduce the apparent incident volume

11. What Is the Difference Between Vertical Scaling and Horizontal Scaling?

A. Vertical scaling adds more power to a single machine while horizontal scaling adds more machines to distribute the load
B. Vertical scaling increases storage capacity while horizontal scaling increases network bandwidth
C. Vertical scaling is for databases and horizontal scaling is for application servers
D. Vertical scaling is cheaper at small volumes while horizontal scaling is cheaper at large volumes, which is why you choose based on cost projections

12. What Is Technical Debt?

A. Shortcuts or suboptimal decisions in code and architecture that make future changes harder, slower, or riskier
B. The accumulated cost of software licences and infrastructure that the organisation is contractually committed to paying
C. The gap between the current technology stack and the approved target state architecture
D. Legacy systems that have not yet been migrated to the cloud as part of the digital transformation programme

13. Why Is It Important That a System Can Be Observed in Production?

A. Because without visibility into how the system behaves under real conditions you cannot diagnose problems, understand performance, or detect failures early
B. Because the compliance team requires evidence that systems are being monitored as part of the annual audit
C. Because the business needs real time dashboards showing transaction volumes and revenue metrics
D. Because the vendor SLA requires the organisation to demonstrate monitoring capability to qualify for support credits

14. What Is the Primary Benefit of a Public Cloud Provider Like AWS or Azure?

A. The ability to provision and scale infrastructure on demand without managing physical hardware, paying only for what you use
B. Guaranteed lower costs compared to on premises infrastructure for all workload types and volumes
C. Automatic compliance with all regulatory requirements since the cloud provider manages the security controls
D. Eliminating the need for a technology team since the cloud provider manages everything end to end

15. What Is the Shared Responsibility Model in Cloud Computing?

A. The cloud provider is responsible for the security of the cloud infrastructure while the customer is responsible for securing what they build and run on it
B. The cloud provider and the customer share the cost of infrastructure equally based on a negotiated commercial agreement
C. Both the cloud provider and the customer have equal responsibility for all aspects of security and neither can delegate
D. The cloud provider assumes full responsibility for everything deployed on their platform as part of the service agreement

16. What Is an Availability Zone?

A. A physically separate data centre within a cloud region, designed so that failures in one zone do not affect others
B. A geographic region where the cloud provider offers services, such as Europe West or US East
C. A virtual network boundary that isolates different customer workloads from each other for security purposes
D. A pricing tier that determines the level of uptime guarantee and support response time for your workloads

17. What Is Infrastructure as Code?

A. Defining and managing cloud infrastructure through machine readable configuration files that can be version controlled and reviewed like software
B. A software tool that automatically generates infrastructure diagrams from the live cloud environment
C. A methodology for documenting infrastructure decisions in a shared wiki so the team can track changes over time
D. An approach where infrastructure costs are coded into the project budget as a separate line item from application development

18. When Should Testing Happen in the Development Lifecycle?

A. Continuously throughout development, with automated tests running on every code change as part of the build pipeline
B. After development is complete, during a dedicated testing phase before the release is approved for production
C. At key milestones defined in the project plan, with formal sign off required before moving to the next phase
D. Primarily before major releases, with exploratory testing conducted by the QA team in the staging environment

19. A Team Tells You They Have 95% Code Coverage

How confident should you be in their quality?

A. Coverage alone does not indicate quality because tests can cover code without meaningfully validating behaviour or edge cases
B. Very confident since 95% coverage means almost all of the codebase has been validated by automated tests
C. Moderately confident but you would want to see the coverage broken down by module to check for gaps in critical areas
D. You would need to compare the coverage metric against the industry benchmark for their technology stack to assess it properly

20. What Is the Purpose of a Chaos Engineering or Game Day Exercise?

A. To deliberately introduce failures into a system to test how it responds and to build confidence that recovery mechanisms work
B. To simulate peak traffic scenarios to verify the infrastructure can handle projected load during high revenue periods
C. To test the disaster recovery plan by failing over to the secondary site and measuring recovery time against the SLA
D. To stress test the team’s incident management process and identify bottlenecks in the escalation procedures

21. What Is the Difference Between a Data Warehouse and a Data Lake?

A. A data warehouse stores structured, curated data optimised for querying and reporting, while a data lake stores raw data in its native format for flexible future use
B. A data warehouse is an on premises solution while a data lake is a cloud native service that replaces the need for traditional databases
C. A data warehouse is owned by the business intelligence team while a data lake is owned by the engineering team, which is why they are governed separately
D. A data warehouse handles historical data for compliance purposes while a data lake handles real time data for operational dashboards

22. Your Organisation Wants to Build a Machine Learning Model to Predict Customer Churn

What is the first question you should ask?

A. Do we have clean, representative data that captures the behaviours and signals that precede churn, and do we understand the biases in that data
B. What is the expected revenue impact of reducing churn by a target percentage, and does it justify the investment in a data science team
C. Which vendor platform offers the best prebuilt churn prediction model so we can deploy quickly without building a team from scratch
D. Can we have a working model within the current quarter so we can demonstrate the value of AI to the executive committee

23. What Is the Biggest Risk of Deploying a Machine Learning Model Without Ongoing Monitoring?

A. The model will silently degrade as real world data drifts away from the data it was trained on, producing increasingly wrong predictions that nobody notices until damage is done
B. The model will consume increasing amounts of compute resources over time, driving up infrastructure costs beyond the original budget
C. The compliance team may flag the model as a risk because it was deployed without a formal model governance review and sign off process
D. The business will lose confidence in AI if the model produces a visible error, which could jeopardise funding for future AI initiatives

24. A Business Stakeholder Wants an AI Feature That Automates a Customer Decision

The team warns that the training data contains historical bias. What do you do?

A. Take the bias concern seriously. Deploying a biased model at scale will amplify discrimination, create regulatory exposure, and damage customer trust in ways that are extremely difficult to undo
B. Proceed with the deployment but add a disclaimer that the model’s recommendations should be reviewed by a human before any final decision is made
C. Ask the data science team to quantify the bias impact and present a risk assessment to the steering committee so leadership can make an informed commercial decision
D. Deprioritise the concern for now and launch the feature since the competitive advantage of being first to market outweighs the risk, and the bias can be addressed in a future iteration

25. You Have One AI Engineer Embedded in a Feature Team

Nobody in the team or its management chain has AI or machine learning experience. The engineer’s work is reviewed by people who do not understand it. How do you evaluate this structure?

A. This is a problem. The engineer has no peers to learn from, no manager who can grow their career, and no quality gate on their work. They will either stagnate, produce unchallenged work of unknown quality, or leave. AI engineers need to sit in or be connected to a community of practice with people who understand their discipline
B. This is fine as long as the engineer has clear deliverables and the feature team has a strong product owner who can validate the business outcomes of the AI work
C. This is efficient. Embedding specialists directly in feature teams ensures their work is aligned with delivery priorities and avoids the overhead of a separate AI team that operates disconnected from the product
D. This is manageable. Provide the engineer with access to external training and conferences so they can maintain their skills, and ensure their performance is measured on delivery milestones like any other team member

26. What Does Data Governance Mean in Practice?

A. Ensuring the organisation knows what data it has, where it lives, who owns it, how it flows, what quality it is in, and what rules govern its use, so that data is treated as a product rather than an accident
B. A framework of policies and committees that approve data access requests and ensure all data usage complies with the relevant regulatory requirements
C. A set of data classification standards and retention policies that are documented and audited annually to satisfy regulatory obligations
D. A technology platform that enforces role based access controls and encrypts data at rest and in transit across all systems

27. You Need to Hire a Senior Engineer

Which quality matters most?

A. Deep curiosity, the ability to reason through unfamiliar problems, and a track record of simplifying complex systems
B. Certifications in the specific technologies your team currently uses, with at least ten years of experience in the industry
C. Strong communication skills and experience presenting to executive stakeholders and steering committees
D. A proven ability to deliver projects on time and within budget, with references from previous programme managers

28. An Engineer Pushes Back on a Technical Decision You Made

They provide evidence you were wrong. What is the ideal response?

A. Thank them, evaluate the evidence, and change the decision if the evidence warrants it because being right matters more than being in charge
B. Acknowledge their input and ask them to document their concerns formally so they can be reviewed in the next architecture review board
C. Listen carefully but explain the broader strategic context they may not be aware of that influenced your original decision
D. Appreciate the initiative but remind them that decisions at your level factor in commercial and timeline considerations beyond the technical merits

29. What Is the Biggest Risk When a Non Technical Leader Runs a Technology Team?

A. They cannot distinguish between genuine technical risk and comfortable excuses, which leads to either missed danger or wasted time
B. They tend to over rely on vendor solutions and consultancies because they cannot evaluate build versus buy decisions independently
C. They struggle to earn the respect of senior engineers, which leads to talent attrition and difficulty recruiting strong replacements
D. They focus on timelines and deliverables rather than the technical foundations that determine whether those deliverables are sustainable

30. A Vendor Promises to Solve a Critical Problem

What is your first concern?

A. Whether the solution creates a dependency that will be expensive or impossible to exit, and what happens when the vendor changes direction
B. Whether the vendor is on the approved procurement list and whether the commercial terms fit within the current budget cycle
C. Whether the vendor has case studies from similar organisations and what their Net Promoter Score is among existing customers
D. Whether the vendor can commit to a delivery timeline that aligns with the programme milestones already communicated to the board

31. You Are Reviewing Two Architecture Proposals

Proposal A is clever and impressive but requires deep expertise to operate. Proposal B is simpler but less elegant. Which do you prefer?

A. Proposal B, because a system that can be understood, operated, and maintained by the team that inherits it is more valuable than one that impresses today
B. Proposal A, because the additional complexity is justified if it delivers significantly better performance metrics
C. Neither until both proposals include detailed cost projections and a total cost of ownership comparison over five years
D. Whichever proposal the lead architect recommends since they have the deepest technical context on the constraints

32. A 97 Slide Strategy Deck Is Presented to You

What is your reaction?

A. Scepticism, because length often compensates for lack of clarity and a strong strategy should be explainable in a few pages
B. Appreciation, because a thorough strategy deck shows the team has done their due diligence and considered all angles
C. Request an executive summary of no more than five slides that highlights the key investment asks and expected returns
D. Review it in detail because strategic decisions of this magnitude deserve comprehensive analysis and supporting evidence

33. A Technology Team Has No Weekly Status Report

They deploy daily, incidents are low, and customers are satisfied. Is this a problem?

A. No. Outcomes are the evidence. If the system works, customers are happy, and the team ships reliably, the absence of a status report means nothing is being hidden
B. Yes. Without a structured weekly report the leadership team has no visibility into what the team is doing and cannot govern effectively
C. It depends. A lightweight status update would be beneficial for alignment even if things are going well, since stakeholders deserve visibility
D. Yes. Consistent reporting is a professional discipline. Even high performing teams need to document their progress for accountability and audit purposes

34. A Team Discovers Halfway Through a Migration That the Original Plan Was Wrong

They adjust and complete the migration successfully but two weeks later than planned. How do you evaluate this?

A. Positively. Learning while doing is an inherent property of complex work. The team adapted to reality and delivered a successful outcome, which is exactly what good engineering looks like
B. As a planning failure. The incorrect assumptions should have been identified during the planning phase. A proper discovery exercise would have prevented the overrun
C. Neutrally. The outcome was acceptable but the team should produce a lessons learned document to prevent similar planning gaps in future projects
D. As a risk management issue. The two week overrun needs to be logged and the planning process needs to include more rigorous assumption validation before execution begins

35. You Ask a Technology Lead How a Project Is Going

They say they do not know yet because the team is still working through some unknowns. How do you respond?

A. Appreciate the honesty. Not knowing is a valid state early in complex work. Ask what they are doing to reduce the unknowns and when they expect to have a clearer picture
B. Ask them to prepare a risk register and preliminary timeline estimate within two days so you have something to report upward
C. Express concern. A technology lead should always be able to articulate the status of their work, even if uncertain, and should present options with probability weightings
D. Escalate the concern. If the lead cannot provide a clear status update, the project may lack adequate governance and oversight

36. What Is the Most Important Thing to Measure About a Technology Team’s Performance?

A. The business outcomes their work enables, including reliability, customer experience, and the ability to change safely
B. Velocity and throughput, measured by story points completed per sprint across all teams
C. Time to market for new features, measured from business request to production deployment
D. Budget adherence, measured by comparing actual technology spend against the approved annual plan

37. A Senior Architect Strongly Disagrees With Your Proposed Approach

They present an alternative in a team meeting. They are blunt and direct. How do you handle this?

A. Welcome it. Blunt disagreement backed by evidence is a sign of a healthy team. Evaluate the alternative on its merits and decide based on what produces the best outcome
B. Thank them for their perspective but ask them to raise concerns through the proper channels rather than challenging your direction in a group setting
C. Acknowledge their passion but remind the team that once a direction is set, the expectation is to commit and execute rather than relitigate decisions
D. Listen but note that architectural decisions need to factor in business timelines and stakeholder commitments, not just technical preferences

38. How Do You View the Role of Engineers in Decision Making?

A. Engineers are domain experts whose knowledge should be actively extracted, challenged, and synthesised into better decisions. The best outcomes come from iterative collaboration, not instruction
B. Engineers should provide technical input and recommendations, but the final decision authority rests with the business leader who owns the commercial outcome
C. Engineers should focus on execution excellence. They are most effective when given clear requirements and the autonomy to choose the implementation approach
D. Engineers should be consulted on technical feasibility, but strategic decisions about what to build and when should be driven by the product and business teams

39. Your Best Engineers Have Stopped Voicing Opinions in Meetings

What does this tell you?

A. Something is wrong. When strong engineers go quiet, it usually means they have concluded that their input does not matter, which means the organisation is about to lose them or already has in spirit
B. They may be focused on delivery. Not every engineer wants to participate in strategic discussions and some prefer to let their code speak for itself
C. It could indicate that the team has matured and aligned around a shared direction, which reduces the need for debate
D. It suggests the decision making process is working efficiently. Fewer objections means the planning and communication have improved

40. An Engineer Tells You the Proposed Deadline Is Unrealistic

The team will either miss it or ship something that breaks. What do you do?

A. Take the warning seriously. Engineers who raise alarms about deadlines are usually right and ignoring them is how organisations end up with production failures and burnt out teams
B. Acknowledge the concern and ask them to propose an alternative timeline with a clear breakdown of what can be delivered by when
C. Thank them for the flag but explain that the deadline was set based on commercial commitments and the team needs to find a way to make it work
D. Ask them to quantify the risk. If they can show specific technical evidence for why the deadline is unrealistic, you will escalate it. Otherwise the plan stands


Answer Key With Explanations

Each option is scored 🟢 🟡 or 🔴, and the explanation focuses on what that option optimises for over time.

1. A Major Platform Decision Was Approved Six Months Ago

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Prioritises the right outcome over protecting past decisionsBetter products and fewer sunken costs
B🟡Honouring governance feels responsibleDelivery of the wrong thing, on time
C🟡Protecting board timelines is professionally safeInformal concerns that go nowhere
D🔴Governance confidence is genuinely valuableEntrenched wrong decisions and learned helplessness

2. Your Team Proposes Removing an Integration Layer

OptionScoreWhy it is attractiveWhat it tends to create
A🟡Respecting investment sounds fairSunk cost paralysis masquerading as empathy
B🟢Merits and customer outcomes as the deciding lensBetter systems and cleaner architecture
C🟡Socialisation reduces frictionDelay that allows the right call to be avoided indefinitely
D🔴Timeline acceleration is always a defensible frameTechnology decisions subordinated to scheduling

3. You Inherit Seven Management Layers

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Cutting what adds no value is the only honest responseFaster decisions and cleaner accountability
B🔴Coordination feels like the problemMore layers solving the symptoms of layers
C🟡Dashboards feel safe and non disruptiveVisibility into a structure that still doesn’t work
D🔴Commercial accountability sounds modernRevenue framing over delivery quality

4. What Is the Primary Purpose of a Technology Strategy Document?

OptionScoreWhy it is attractiveWhat it tends to create
A🔴Budget alignment is how things get fundedStrategy in service of approval rather than clarity
B🟢Clarity over what you will and will not build is rare and powerfulFewer wasted investments and better decisions
C🟡Accountability sounds matureAccountability for the wrong things if the strategy is wrong
D🟡Communicating vision is legitimateStyle over substance if the audience cannot push back

5. What Does Blast Radius Mean?

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Correct definition with systems thinking built inBetter architectural decisions and safer design
B🟡Data loss is a real concernConflates backup and resilience concepts
C🟡Customer impact is the right concernMisses cascading failure as the core concept
D🔴Financial framing is relatable to business headsRevenue lens applied to an engineering concept

6. When Designing a Critical System, What Is Your Primary Architectural Concern?

OptionScoreWhy it is attractiveWhat it tends to create
A🟡Revenue targets are a real design constraintOptimises for scale at the expense of resilience
B🟢Graceful failure is the most durable design principleSystems that fail safely rather than catastrophically
C🟡Vendor SLAs feel like insuranceOutsources architectural thinking to contracts
D🟡Reference models reduce reinventionCompliance over fitness

7. What Does Designing Assuming Breach Mean?

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Layered defence and containment is the correct instinctSystems that limit damage when breaches happen
B🟡Insurance feels like risk managementFinancial mitigation without technical defence
C🟡Penetration testing is a real practiceAnnual exercises are not the same as assume breach design
D🟡Compliance feels like securityCompliance theatre that passes audits and fails breaches

8. A Project Is Behind Schedule

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Quality over date is the harder but more durable choiceSystems that work and users that trust them
B🔴External commitments feel bindingMore people working on a broken plan faster
C🟡Extension with full scope sounds balancedMay be right if revenue calculation is honest
D🟡Splitting delivery sounds pragmaticCan create integration debt if the fast follow never arrives

9. How Should Work Flow Through a Technology Team?

OptionScoreWhy it is attractiveWhat it tends to create
A🟡Agile ceremonies are familiar and teachableProcess compliance rather than actual agility
B🟢Continuous flow and minimal handoffs are what actually workFast learning and high quality delivery
C🔴Quarterly cycles sound like proper governancePlanning theatre that misses reality by a quarter
D🟡Product owner coordination feels organisedBacklogs that grow rather than systems that improve

10. Features Are on Time but Incidents Are Increasing

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Delivery masking quality debt is the most common failure patternEarly intervention before the system breaks loudly
B🟡Tooling gaps are realTreats a symptom without asking what caused it
C🟡Infrastructure scaling is a genuine bottleneckDeflects from delivery quality as the root cause
D🔴Process improvement sounds constructiveReduces apparent incidents without reducing actual ones

11. Vertical Versus Horizontal Scaling

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Correct and preciseAbility to make informed infrastructure decisions
B🟡Storage and bandwidth are real dimensionsFundamentally wrong definition
C🟡Database versus app server is a familiar splitOversimplification that breaks in practice
D🔴Cost framing is relatableReduces a technical question to a finance question

12. What Is Technical Debt?

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Correct definition that connects to consequencesAbility to have honest conversations about investment
B🟡Licence and infrastructure costs feel like debtConfuses financial obligations with technical constraints
C🟡Target state framing is familiar from transformation programmesReduces debt to a migration backlog
D🟡Legacy systems are a common mental modelMisses the fact that new systems accumulate debt too

13. Why Does Observability in Production Matter?

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Correct and operationally groundedEngineers who can diagnose and improve systems
B🟡Compliance evidence is a real requirementMonitoring as audit artefact rather than operational tool
C🟡Business dashboards are a legitimate needConfuses business reporting with system observability
D🔴SLA qualification sounds like a practical reasonObservability in service of vendor contracts, not operations

14. The Primary Benefit of Public Cloud

OptionScoreWhy it is attractiveWhat it tends to create
A🟢On demand provisioning and elastic cost is the real valueInfrastructure that scales with reality
B🟡Cost reduction is often part of the pitchFalse certainty that ignores workload specifics
C🔴Compliance automation sounds appealingDangerous misunderstanding of shared responsibility
D🔴Elimination of overhead sounds efficientCloud adoption without understanding what you still own

15. The Shared Responsibility Model

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Correct and preciseSecurity decisions made with accurate mental models
B🟡Commercial framing is relatableConfuses security responsibility with cost sharing
C🟡Shared accountability sounds balancedRemoves the clarity that makes the model useful
D🔴Full provider responsibility sounds like the dealOrganisations that discover their responsibilities too late

16. What Is an Availability Zone?

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Correct and operationally preciseArchitecture that plans for and survives zone failures
B🟡Regions are a real cloud conceptConflates region with zone
C🟡Network isolation is a related cloud conceptConfuses network boundaries with physical redundancy
D🔴Pricing tiers and uptime SLAs are familiar procurement conceptsInfrastructure decisions made on commercial rather than technical grounds

17. What Is Infrastructure as Code?

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Correct and captures the key propertiesReproducible, reviewable, version controlled infrastructure
B🟡Diagram generation is a related practiceConfuses documentation tooling with infrastructure management
C🟡Documentation in a shared wiki sounds collaborativeInfrastructure decisions recorded but not enforced
D🔴Budget coding sounds like responsible governanceA finance process confused for an engineering practice

18. When Should Testing Happen?

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Continuous automated testing is the correct answerFast feedback and high confidence with every change
B🟡Dedicated testing phases feel thoroughLate discovery of problems that compound quickly
C🔴Milestone sign off sounds like governanceTesting as a gate rather than a continuous signal
D🟡Pre release exploratory testing is real and valuableLeaves too much surface area uncovered between releases

19. A Team Has 95% Code Coverage

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Coverage without behaviour validation is a known trapHonest assessment of quality rather than metric satisfaction
B🔴95% sounds high and therefore safeFalse confidence in a metric that can be gamed
C🟡Module level breakdown adds nuanceStill treats coverage as the primary quality signal
D🔴Benchmarking sounds rigorousComparing against benchmarks of a flawed metric

20. What Is the Purpose of a Chaos Engineering Exercise?

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Deliberate failure injection to test recovery is correctVerified resilience rather than assumed resilience
B🟡Load testing is a related practiceConfuses performance testing with resilience testing
C🟡DR failover testing is real and importantNarrower than chaos engineering as a practice
D🔴Incident process stress testing sounds usefulFocuses on the organisation’s response rather than the system’s behaviour

21. Data Warehouse Versus Data Lake

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Correct definition that captures the key architectural differenceInformed decisions about where data belongs
B🟡On premises versus cloud is a familiar axisConflates deployment model with data architecture
C🟡Team ownership is a real governance questionReduces an architectural concept to an org chart question
D🔴Historical versus real time is a familiar framingFundamentally misunderstands both concepts

22. Building a Churn Prediction Model

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Data quality and bias are the foundation of any modelModels that work and can be trusted
B🟡Revenue impact is a legitimate prioritisation questionSkips past the foundational data question
C🟡Vendor platforms are a real optionDeploy fast, discover limits later
D🔴Demonstrating value to the executive committee is real pressureAI theatre that looks impressive and produces wrong answers

23. The Biggest Risk of Unmonitored Production Models

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Data drift and silent degradation is the real riskMonitoring practices that catch decay before it causes harm
B🟡Compute costs are a real operational concernMisses the accuracy decay that is far more damaging
C🟡Governance review is a legitimate processCompliance framing misses the operational risk
D🔴Executive confidence is a real concernOptimises for perception rather than reliability

24. A Biased AI Model Is Proposed for Customer Decisions

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Takes bias seriously as a first order concernEthical deployment and regulatory protection
B🟡Human review sounds like a safeguardScales bias while providing legal cover
C🟡Steering committee decision sounds like governanceDelegates an ethical decision to a commercial forum
D🔴First mover advantage is a real competitive argumentDiscrimination at scale with a future iteration that may never arrive

25. One AI Engineer Embedded in a Feature Team

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Recognises the structural failure clearlyDeliberate community of practice and proper quality gates
B🟡Clear deliverables and product ownership sound sufficientUnreviewed AI work validated by people who cannot evaluate it
C🔴Embedded specialists sound efficientAI capability that has no peers, no quality gate, and no future
D🟡Training and milestone measurement sound supportiveIsolates the engineer while providing the appearance of support

26. What Is Data Governance in Practice?

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Treats data as a product with full lifecycle accountabilityTrustworthy data that can be used with confidence
B🟡Policy and committee governance is a real structureBureaucratic access management masquerading as governance
C🟡Classification and retention policies are real requirementsCompliance artefacts without operational governance
D🔴Technology controls feel like governanceEnforces access without understanding what the data is or means

27. You Need to Hire a Senior Engineer

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Curiosity and simplification track record predict long term impactEngineers who make systems better rather than just larger
B🟡Certifications feel like proof of knowledgeCredential matching rather than capability hiring
C🟡Communication with executives sounds valuableEngineers selected for stakeholder management over technical depth
D🔴Delivery track record sounds like the right signalEngineers selected by programme managers rather than engineers

28. An Engineer Proves You Wrong

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Being right matters more than being in chargeTrust, psychological safety, and better decisions
B🟡Formal documentation sounds thoroughBureaucratic delay that signals pushback is unwelcome
C🟡Strategic context is a real considerationStrategic context used to override technical evidence
D🔴Commercial considerations are realTeaches engineers their input is decorative

29. The Biggest Risk of a Non Technical Leader

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Inability to distinguish risk from excuses is the core failure modeLeaders who get fooled in both directions
B🟡Vendor over reliance is a real patternOne manifestation of a deeper capability gap
C🟡Talent attrition is a real consequenceSymptom rather than root cause
D🟡Timeline focus over technical foundations is commonAnother symptom of the same underlying problem

30. A Vendor Promises to Solve a Critical Problem

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Exit costs and vendor direction changes are the durable concernsRelationships that preserve architectural independence
B🟡Procurement process is a real requirementApproved vendor lists substituting for technical evaluation
C🟡Case studies are useful social proofNPS and reference customers replacing structural analysis
D🔴Timeline alignment is always relevantVendor selected based on board commitments rather than fit

31. Clever Architecture Versus Simple Architecture

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Operability and maintainability outlast impressivenessSystems that the next team can understand and fix at 02:00
B🟡Performance metrics are a real considerationComplexity justified by benchmarks that matter at demo time
C🟡TCO analysis is legitimateAnalysis paralysis replacing a clear architectural principle
D🟡Architect recommendation makes senseDefers to expertise but avoids the underlying principle

32. A 97 Slide Strategy Deck

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Length compensating for clarity is a real and common failurePressure for clear thinking over comprehensive coverage
B🔴Thoroughness sounds like due diligenceRewarding volume over clarity
C🟡Executive summary sounds practicalMay preserve the 97 slides rather than replacing them
D🟡Comprehensive review sounds responsible97 slides reviewed without asking whether they add up to a strategy

33. A High Performing Team Has No Status Report

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Outcomes are the evidence. Reports are not the productFreedom for high performing teams to focus on results
B🔴Governance visibility sounds like a legitimate requirementReporting as a proxy for leadership confidence
C🟡Lightweight alignment sounds reasonableProcess for its own sake introduced into a team that does not need it
D🔴Accountability and audit discipline sounds professionalBureaucratic expectations imposed on a team that is already delivering

34. A Team Adjusts and Delivers Two Weeks Late

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Adaptation during complex work is exactly correct behaviourA culture that engages honestly with what they discover
B🔴Planning failure is a clean and familiar frameTeams that fabricate certainty rather than discovering truth
C🟡Lessons learned sounds constructiveDocument production as a substitute for genuine understanding
D🔴Risk management logging sounds rigorousMore assumption validation that produces more fabricated certainty

35. A Lead Says They Do Not Know Yet

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Not knowing is valid. What matters is what reduces the unknownsHonest engineering cultures that surface uncertainty early
B🟡Having something to report upward sounds responsibleRisk registers produced to satisfy upward reporting rather than to manage risk
C🟡Probability weightings sound rigorousManufactured precision on genuinely uncertain situations
D🔴Escalation sounds like accountabilityPenalising honesty and teaching people to fake confidence

36. What Is the Most Important Thing to Measure?

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Business outcomes, reliability, and safe change are what technology actually exists to produceMeasurement that connects engineering work to things that matter
B🟡Velocity is a familiar agile metricStory point farming that looks productive and may not be
C🟡Time to market is a real business concernOptimises for speed over quality and sustainability
D🔴Budget adherence sounds like financial disciplineMeasuring spend rather than value

37. A Senior Architect Disagrees Publicly and Bluntly

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Blunt disagreement backed by evidence is a sign of healthBetter decisions and a culture where truth surfaces
B🟡Proper channels sound professionalTeaching people that public disagreement is insubordination
C🟡Commitment after a decision is a real normCommitment used to prevent legitimate reconsideration
D🔴Business timelines as the final frame sounds balancedTechnical expertise subordinated to schedule compliance

38. The Role of Engineers in Decision Making

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Active extraction and synthesis of engineering knowledge is how the best decisions get madeProducts built from collective intelligence rather than individual instruction
B🟡Business leaders owning commercial outcomes sounds rightTechnical input as decoration on pre made decisions
C🟡Execution excellence and implementation autonomy sound respectfulEngineers who are good at what they are told but disconnected from why
D🔴Product and business teams driving strategy sounds efficientStrategy uninformed by the technical reality that will determine whether it is achievable

39. Your Best Engineers Have Gone Quiet

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Silence from strong engineers is almost always a warningEarly intervention before the best people leave in spirit or in practice
B🟡Focus and preference for code over meetings is realConvenient reframe that avoids the harder question
C🟡Team maturity and alignment sound positiveAlignment that is actually submission
D🔴Fewer objections sounds like improved governanceA team that has learned not to disagree with leaders who do not want to hear it

40. An Engineer Says the Deadline Is Unrealistic

OptionScoreWhy it is attractiveWhat it tends to create
A🟢Engineers who raise deadline alarms are usually rightCredible timelines and teams that are not burned out shipping things that break
B🟡Alternative timeline with breakdown sounds constructiveAmber because the warning should be taken seriously before asking for proof
C🔴Commercial commitments sound bindingTeams that silently absorb impossible constraints and deliver broken software
D🟡Quantified risk sounds rigorousCan become a bar set high enough that legitimate warnings are never escalated

Interpretation

Mostly 🟢 means you approach technology leadership with the right instincts. You understand that engineering knowledge is a strategic resource, that quality and sustainability outlast delivery theatre, and that your role is to create conditions in which strong engineers can do their best work.

Mostly 🟡 means your instincts are not dangerous but they are shallow. You rely on process, optics, and familiar governance structures because they feel responsible. Under pressure, those defaults will pull you toward comfort rather than clarity. Watch for which categories your 🟡 answers cluster in because that is where your blind spots live.

Mostly 🔴 means you optimise for timelines, reporting, and the appearance of control. You likely see opinionated engineers as a management problem rather than an intellectual resource. The technology organisations you lead will deliver on time to specifications that were wrong, retain compliant engineers who stopped caring, and struggle to understand why customers leave.

The most damaging technology leaders are not the ones who know nothing. They are the ones who know enough to sound credible while making decisions that slowly hollow out the organisations they run.


Inspired by Why Andrew Baker Is the World’s Worst CTO

Business Heads: Technology Leadership Competence Assessment

A Self Assessment for Technology Leaders

This questionnaire explores how you think about technology leadership, systems, teams, and delivery. There are no right or wrong answers. Each question presents four options that reflect different leadership styles and priorities. Simply select the option that best reflects your natural instinct in each situation.

Select one answer per question. Do not overthink it. Your first instinct is what matters.

1 Leadership Philosophy

Question 1. A major platform decision was approved by the steering committee six months ago. New evidence suggests it may be the wrong choice. What do you do?

A) Revisit the decision with the new evidence and recommend a course correction even if it causes short term disruption

B) Flag the concern but continue execution since the committee already approved it and reversing would delay the programme

C) Raise it informally but keep delivery on track since the timeline commitments to the board cannot slip

D) Continue as planned because reopening approved decisions undermines confidence in the governance process

Question 2. Your team proposes simplifying a system by removing an integration layer. It will reduce complexity but invalidate three months of another team’s work. How do you proceed?

A) Protect the other team’s work and find a compromise that keeps both approaches since we need to respect the investment already made

B) Evaluate the simplification on its technical merits regardless of sunk cost and proceed if the outcome is better for customers

C) Delay the decision until next quarter’s planning cycle so it can be properly socialised across all stakeholders

D) Proceed only if the simplification can be shown to accelerate the current delivery timeline

Question 3. You inherit a technology organisation with seven management layers between the CTO and the engineers writing code. What is your first instinct?

A) Understand why each layer exists and remove any that do not directly contribute to decision quality or delivery outcomes

B) Add a dedicated delivery management function to coordinate across the layers more effectively

C) Maintain the structure but introduce better reporting dashboards so you can see through the layers

D) Restructure the layers around revenue streams so each layer has clear commercial accountability

Question 4. What is the primary purpose of a technology strategy document?

A) To secure budget approval by demonstrating alignment between technology investments and projected revenue growth

B) To reduce uncertainty by clarifying what the organisation will and will not build, and why

C) To provide a roadmap with delivery dates that the business can hold the technology team accountable to

D) To communicate the technology vision to non technical stakeholders in a way they find compelling

2 Architecture and Systems Thinking

Question 5. What does the term blast radius mean in the context of systems architecture?

A) The scope of impact when a single component fails, and how far the failure propagates across dependent systems

B) The amount of data lost during a disaster recovery event before backups can be restored

C) The total number of customers affected during a planned maintenance window

D) The financial exposure created by a system outage, measured in lost revenue per minute

Question 6. When designing a critical system, which of the following should be your primary architectural concern?

A) Ensuring the system can scale to meet projected revenue targets for the next three years

B) Designing for graceful failure so the system degrades safely rather than failing catastrophically

C) Selecting the vendor with the strongest enterprise support agreement and SLA guarantees

D) Ensuring the architecture aligns with the approved enterprise reference model and standards

Question 7. What does it mean to design a system assuming breach will happen?

A) Building layered defences, monitoring, and containment so that when a breach occurs the damage is limited and detected quickly

B) Purchasing comprehensive cyber insurance to cover the financial impact of a breach event

C) Conducting annual penetration tests and remediating all critical findings before the next audit cycle

D) Ensuring all systems are compliant with the relevant regulatory frameworks and industry standards

3 Delivery and Process

Question 8. A project is behind schedule. The team suggests reducing scope to meet the deadline. The business stakeholder wants the full scope delivered on time. What do you recommend?

A) Deliver the reduced scope with high quality and iterate, since shipping broken software on time is worse than shipping less software that works

B) Add additional resources to accelerate delivery since the business committed to the date with external partners

C) Negotiate a two week extension with the full scope since the revenue impact of a delayed launch is manageable

D) Split the team to deliver the core features on time and the remaining features two weeks later as a fast follow

Question 9. How should work ideally flow through a well functioning technology team?

A) Through two week sprints with defined ceremonies, backlog grooming, sprint reviews, and retrospectives

B) Through continuous small changes deployed frequently with clear ownership and minimal handoffs

C) Through quarterly planning cycles with monthly milestone reviews and weekly status reporting

D) Through a prioritised backlog managed by a product owner who coordinates with the business on delivery sequencing

Question 10. A team is consistently delivering features on time but production incidents are increasing. What does this tell you?

A) The team is likely cutting corners on quality to meet deadlines and the delivery metric is masking a growing technical debt problem

B) The team needs better production support tooling and a dedicated site reliability function

C) The team is delivering well but the infrastructure team is not scaling the platform to match the increased feature throughput

D) The incident management process needs improvement since faster triage would reduce the apparent incident volume

4 Technical Fundamentals

Question 11. What is the difference between vertical scaling and horizontal scaling?

A) Vertical scaling adds more power to a single machine while horizontal scaling adds more machines to distribute the load

B) Vertical scaling increases storage capacity while horizontal scaling increases network bandwidth

C) Vertical scaling is for databases and horizontal scaling is for application servers

D) Vertical scaling is cheaper at small volumes while horizontal scaling is cheaper at large volumes, which is why you choose based on cost projections

Question 12. What is technical debt?

A) Shortcuts or suboptimal decisions in code and architecture that make future changes harder, slower, or riskier

B) The accumulated cost of software licences and infrastructure that the organisation is contractually committed to paying

C) The gap between the current technology stack and the approved target state architecture

D) Legacy systems that have not yet been migrated to the cloud as part of the digital transformation programme

Question 13. Why is it important that a system can be observed in production?

A) Because without visibility into how the system behaves under real conditions you cannot diagnose problems, understand performance, or detect failures early

B) Because the compliance team requires evidence that systems are being monitored as part of the annual audit

C) Because the business needs real time dashboards showing transaction volumes and revenue metrics

D) Because the vendor SLA requires the organisation to demonstrate monitoring capability to qualify for support credits

5 Cloud Computing

Question 14. What is the primary benefit of using a public cloud provider like AWS or Azure?

A) The ability to provision and scale infrastructure on demand without managing physical hardware, paying only for what you use

B) Guaranteed lower costs compared to on premises infrastructure for all workload types and volumes

C) Automatic compliance with all regulatory requirements since the cloud provider manages the security controls

D) Eliminating the need for a technology team since the cloud provider manages everything end to end

Question 15. What is the shared responsibility model in cloud computing?

A) The cloud provider is responsible for the security of the cloud infrastructure while the customer is responsible for securing what they build and run on it

B) The cloud provider and the customer share the cost of infrastructure equally based on a negotiated commercial agreement

C) Both the cloud provider and the customer have equal responsibility for all aspects of security and neither can delegate

D) The cloud provider assumes full responsibility for everything deployed on their platform as part of the service agreement

Question 16. What is an availability zone in the context of cloud infrastructure?

A) A physically separate data centre within a cloud region, designed so that failures in one zone do not affect others

B) A geographic region where the cloud provider offers services, such as Europe West or US East

C) A virtual network boundary that isolates different customer workloads from each other for security purposes

D) A pricing tier that determines the level of uptime guarantee and support response time for your workloads

Question 17. What is Infrastructure as Code?

A) Defining and managing cloud infrastructure through machine readable configuration files that can be version controlled and reviewed like software

B) A software tool that automatically generates infrastructure diagrams from the live cloud environment

C) A methodology for documenting infrastructure decisions in a shared wiki so the team can track changes over time

D) An approach where infrastructure costs are coded into the project budget as a separate line item from application development

6 Testing Strategy

Question 18. When should testing happen in the development lifecycle?

A) Continuously throughout development, with automated tests running on every code change as part of the build pipeline

B) After development is complete, during a dedicated testing phase before the release is approved for production

C) At key milestones defined in the project plan, with formal sign off required before moving to the next phase

D) Primarily before major releases, with exploratory testing conducted by the QA team in the staging environment

Question 19. A team tells you they have 95% code coverage. How confident should you be in their quality?

A) Coverage alone does not indicate quality because tests can cover code without meaningfully validating behaviour or edge cases

B) Very confident since 95% coverage means almost all of the codebase has been validated by automated tests

C) Moderately confident but you would want to see the coverage broken down by module to check for gaps in critical areas

D) You would need to compare the coverage metric against the industry benchmark for their technology stack to assess it properly

Question 20. What is the purpose of a chaos engineering or game day exercise?

A) To deliberately introduce failures into a system to test how it responds and to build confidence that recovery mechanisms work

B) To simulate peak traffic scenarios to verify the infrastructure can handle projected load during high revenue periods

C) To test the disaster recovery plan by failing over to the secondary site and measuring recovery time against the SLA

D) To stress test the team’s incident management process and identify bottlenecks in the escalation procedures

7 Data and AI

Question 21. What is the difference between a data warehouse and a data lake?

A) A data warehouse stores structured, curated data optimised for querying and reporting, while a data lake stores raw data in its native format for flexible future use

B) A data warehouse is an on premises solution while a data lake is a cloud native service that replaces the need for traditional databases

C) A data warehouse is owned by the business intelligence team while a data lake is owned by the engineering team, which is why they are governed separately

D) A data warehouse handles historical data for compliance purposes while a data lake handles real time data for operational dashboards

Question 22. Your organisation wants to build a machine learning model to predict customer churn. What is the first question you should ask?

A) Do we have clean, representative data that captures the behaviours and signals that precede churn, and do we understand the biases in that data

B) What is the expected revenue impact of reducing churn by a target percentage, and does it justify the investment in a data science team

C) Which vendor platform offers the best prebuilt churn prediction model so we can deploy quickly without building a team from scratch

D) Can we have a working model within the current quarter so we can demonstrate the value of AI to the executive committee

Question 23. What is the biggest risk of deploying a machine learning model into production without ongoing monitoring?

A) The model will silently degrade as real world data drifts away from the data it was trained on, producing increasingly wrong predictions that nobody notices until damage is done

B) The model will consume increasing amounts of compute resources over time, driving up infrastructure costs beyond the original budget

C) The compliance team may flag the model as a risk because it was deployed without a formal model governance review and sign off process

D) The business will lose confidence in AI if the model produces a visible error, which could jeopardise funding for future AI initiatives

Question 24. A business stakeholder asks you to build an AI feature that automates a customer decision. The team warns that the training data contains historical bias. What do you do?

A) Take the bias concern seriously. Deploying a biased model at scale will amplify discrimination, create regulatory exposure, and damage customer trust in ways that are extremely difficult to undo

B) Proceed with the deployment but add a disclaimer that the model’s recommendations should be reviewed by a human before any final decision is made

C) Ask the data science team to quantify the bias impact and present a risk assessment to the steering committee so leadership can make an informed commercial decision

D) Deprioritise the concern for now and launch the feature since the competitive advantage of being first to market outweighs the risk, and the bias can be addressed in a future iteration

Question 25. You have hired one AI engineer and placed them alone in a feature team surrounded by backend and frontend developers. Nobody in the team or its management chain has AI or machine learning experience. The engineer’s work is reviewed by people who do not understand it. How do you evaluate this structure?

A) This is a problem. The engineer has no peers to learn from, no manager who can grow their career, and no quality gate on their work. They will either stagnate, produce unchallenged work of unknown quality, or leave. AI engineers need to sit in or be connected to a community of practice with people who understand their discipline

B) This is fine as long as the engineer has clear deliverables and the feature team has a strong product owner who can validate the business outcomes of the AI work

C) This is efficient. Embedding specialists directly in feature teams ensures their work is aligned with delivery priorities and avoids the overhead of a separate AI team that operates disconnected from the product

D) This is manageable. Provide the engineer with access to external training and conferences so they can maintain their skills, and ensure their performance is measured on delivery milestones like any other team member

Question 26. What does data governance mean in practice?

A) Ensuring the organisation knows what data it has, where it lives, who owns it, how it flows, what quality it is in, and what rules govern its use, so that data is treated as a product rather than an accident

B) A framework of policies and committees that approve data access requests and ensure all data usage complies with the relevant regulatory requirements

C) A set of data classification standards and retention policies that are documented and audited annually to satisfy regulatory obligations

D) A technology platform that enforces role based access controls and encrypts data at rest and in transit across all systems

8 People and Hiring

Question 27. You need to hire a senior engineer. Which quality matters most?

A) Deep curiosity, the ability to reason through unfamiliar problems, and a track record of simplifying complex systems

B) Certifications in the specific technologies your team currently uses, with at least ten years of experience in the industry

C) Strong communication skills and experience presenting to executive stakeholders and steering committees

D) A proven ability to deliver projects on time and within budget, with references from previous programme managers

Question 28. An engineer pushes back on a technical decision you have made, providing evidence you were wrong. What is the ideal response?

A) Thank them, evaluate the evidence, and change the decision if the evidence warrants it because being right matters more than being in charge

B) Acknowledge their input and ask them to document their concerns formally so they can be reviewed in the next architecture review board

C) Listen carefully but explain the broader strategic context they may not be aware of that influenced your original decision

D) Appreciate the initiative but remind them that decisions at your level factor in commercial and timeline considerations beyond the technical merits

Question 29. What is the biggest risk when a non technical leader runs a technology team?

A) They cannot distinguish between genuine technical risk and comfortable excuses, which leads to either missed danger or wasted time

B) They tend to over rely on vendor solutions and consultancies because they cannot evaluate build versus buy decisions independently

C) They struggle to earn the respect of senior engineers, which leads to talent attrition and difficulty recruiting strong replacements

D) They focus on timelines and deliverables rather than the technical foundations that determine whether those deliverables are sustainable

9 Quality and Sustainability

Question 30. A vendor promises to solve a critical problem with their platform. What is your first concern?

A) Whether the solution creates a dependency that will be expensive or impossible to exit, and what happens when the vendor changes direction

B) Whether the vendor is on the approved procurement list and whether the commercial terms fit within the current budget cycle

C) Whether the vendor has case studies from similar organisations and what their Net Promoter Score is among existing customers

D) Whether the vendor can commit to a delivery timeline that aligns with the programme milestones already communicated to the board

Question 31. You are reviewing two architecture proposals. Proposal A is clever and impressive but requires deep expertise to operate. Proposal B is simpler but less elegant. Which do you prefer?

A) Proposal B, because a system that can be understood, operated, and maintained by the team that inherits it is more valuable than one that impresses today

B) Proposal A, because the additional complexity is justified if it delivers significantly better performance metrics

C) Neither until both proposals include detailed cost projections and a total cost of ownership comparison over five years

D) Whichever proposal the lead architect recommends since they have the deepest technical context on the constraints

Question 32. A 97 slide strategy deck is presented to you. What is your reaction?

A) Scepticism, because length often compensates for lack of clarity and a strong strategy should be explainable in a few pages

B) Appreciation, because a thorough strategy deck shows the team has done their due diligence and considered all angles

C) Request an executive summary of no more than five slides that highlights the key investment asks and expected returns

D) Review it in detail because strategic decisions of this magnitude deserve comprehensive analysis and supporting evidence

10 Reporting and Planning

Question 33. A technology team has no weekly status report. They deploy daily, incidents are low, and customers are satisfied. Is this a problem?

A) No. Outcomes are the evidence. If the system works, customers are happy, and the team ships reliably, the absence of a status report means nothing is being hidden

B) Yes. Without a structured weekly report the leadership team has no visibility into what the team is doing and cannot govern effectively

C) It depends. A lightweight status update would be beneficial for alignment even if things are going well, since stakeholders deserve visibility

D) Yes. Consistent reporting is a professional discipline. Even high performing teams need to document their progress for accountability and audit purposes

Question 34. A team starts a complex migration and discovers halfway through that the original plan was based on incorrect assumptions. They adjust and complete the migration successfully but two weeks later than planned. How do you evaluate this?

A) Positively. Learning while doing is an inherent property of complex work. The team adapted to reality and delivered a successful outcome, which is exactly what good engineering looks like

B) As a planning failure. The incorrect assumptions should have been identified during the planning phase. A proper discovery exercise would have prevented the overrun

C) Neutrally. The outcome was acceptable but the team should produce a lessons learned document to prevent similar planning gaps in future projects

D) As a risk management issue. The two week overrun needs to be logged and the planning process needs to include more rigorous assumption validation before execution begins

Question 35. You ask a technology lead how a project is going. They say they do not know yet because the team is still working through some unknowns. How do you respond?

A) Appreciate the honesty. Not knowing is a valid state early in complex work. Ask what they are doing to reduce the unknowns and when they expect to have a clearer picture

B) Ask them to prepare a risk register and preliminary timeline estimate within two days so you have something to report upward

C) Express concern. A technology lead should always be able to articulate the status of their work, even if uncertain, and should present options with probability weightings

D) Escalate the concern. If the lead cannot provide a clear status update, the project may lack adequate governance and oversight

Question 36. What is the most important thing to measure about a technology team’s performance?

A) The business outcomes their work enables, including reliability, customer experience, and the ability to change safely

B) Velocity and throughput, measured by story points completed per sprint across all teams

C) Time to market for new features, measured from business request to production deployment

D) Budget adherence, measured by comparing actual technology spend against the approved annual plan

11 Relationship with Technologists

Question 37. A senior architect strongly disagrees with your proposed approach and presents an alternative in a team meeting. They are blunt and direct. How do you handle this?

A) Welcome it. Blunt disagreement backed by evidence is a sign of a healthy team. Evaluate the alternative on its merits and decide based on what produces the best outcome

B) Thank them for their perspective but ask them to raise concerns through the proper channels rather than challenging your direction in a group setting

C) Acknowledge their passion but remind the team that once a direction is set, the expectation is to commit and execute rather than relitigate decisions

D) Listen but note that architectural decisions need to factor in business timelines and stakeholder commitments, not just technical preferences

Question 38. How do you view the role of engineers in the decision making process?

A) Engineers are domain experts whose knowledge should be actively extracted, challenged, and synthesised into better decisions. The best outcomes come from iterative collaboration, not instruction

B) Engineers should provide technical input and recommendations, but the final decision authority rests with the business leader who owns the commercial outcome

C) Engineers should focus on execution excellence. They are most effective when given clear requirements and the autonomy to choose the implementation approach

D) Engineers should be consulted on technical feasibility, but strategic decisions about what to build and when should be driven by the product and business teams

Question 39. You notice your best engineers have stopped voicing opinions in meetings. What does this tell you?

A) Something is wrong. When strong engineers go quiet, it usually means they have concluded that their input does not matter, which means the organisation is about to lose them or already has in spirit

B) They may be focused on delivery. Not every engineer wants to participate in strategic discussions and some prefer to let their code speak for itself

C) It could indicate that the team has matured and aligned around a shared direction, which reduces the need for debate

D) It suggests the decision making process is working efficiently. Fewer objections means the planning and communication have improved

Question 40. An engineer tells you the proposed deadline is unrealistic and the team will either miss it or ship something that breaks. What do you do?

A) Take the warning seriously. Engineers who raise alarms about deadlines are usually right and ignoring them is how organisations end up with production failures and burnt out teams

B) Acknowledge the concern and ask them to propose an alternative timeline with a clear breakdown of what can be delivered by when

C) Thank them for the flag but explain that the deadline was set based on commercial commitments and the team needs to find a way to make it work

D) Ask them to quantify the risk. If they can show specific technical evidence for why the deadline is unrealistic, you will escalate it. Otherwise the plan stands


Assessor Guide

Everything below this line is for the assessor only. Do not share with the candidate.

Traffic Light Scoring

Each answer is scored using a traffic light system.

Green. Strong technology leadership instinct. The answer demonstrates understanding of systems thinking, quality, sustainability, customer outcomes, or respect for engineering as a discipline.

Amber. Acceptable but surface level. The answer is not wrong but reveals a preference for process, optics, conventional wisdom, or a management lens over a technology leadership lens.

Red. Concerning. The answer reveals a fixation on timelines, revenue projections, reporting, governance ceremony, or a belief that technologists are interchangeable resources who should execute rather than think.

Answer Key

#CategoryGreenAmberRed
1LeadershipAB, CD
2LeadershipBA, CD
3LeadershipAB, CD
4LeadershipBC, DA
5ArchitectureAB, CD
6ArchitectureBC, DA
7ArchitectureAB, CD
8DeliveryAC, DB
9DeliveryBA, DC
10DeliveryAB, CD
11TechnicalAB, CD
12TechnicalAB, CD
13TechnicalAB, DC
14CloudAB, CD
15CloudAB, CD
16CloudAB, CD
17CloudAB, CD
18TestingAB, DC
19TestingAB, CD
20TestingAC, DB
21Data and AIAB, CD
22Data and AIAB, CD
23Data and AIAB, CD
24Data and AIAB, CD
25Data and AIAB, DC
26Data and AIAB, CD
27PeopleAB, CD
28PeopleAB, CD
29PeopleAB, CD
30QualityAB, CD
31QualityAB, DC
32QualityAB, CD
33ReportingAC, DB
34ReportingAC, DB
35ReportingAB, CD
36ReportingAB, CD
37TechnologistsAB, CD
38TechnologistsAB, DC
39TechnologistsAB, CD
40TechnologistsAB, DC

Scoring Thresholds

30 to 40 Green. Strong candidate. Likely to build sustainable technology, retain talented engineers, and make sound architectural decisions.

20 to 29 Green. Moderate. May need coaching on the difference between managing a technology team and leading one. Watch for patterns in which categories the red answers cluster.

Below 20 Green. Significant risk. Likely to prioritise optics and timelines over quality, struggle to retain senior technologists, and make hiring decisions based on compliance rather than capability.

10 or more Red. Disqualifying regardless of green count. The candidate consistently gravitates toward answers that would damage engineering culture, product quality, and team retention.

Red Flag Patterns

Beyond the raw count, watch for clustering patterns that reveal specific blind spots.

The Timeline Addict. Red answers cluster in Delivery and Quality. The candidate treats every question as a scheduling problem and evaluates every decision through the lens of “will this delay the programme.”

The Dashboard Governor. Red answers cluster in Reporting and Planning. The candidate believes that better reporting equals better understanding, and that learning while doing is evidence of poor planning rather than an inherent property of complex work.

The Order Taker Factory. Red answers cluster in Relationship with Technologists. The candidate sees engineers as execution resources, gets uncomfortable with opinionated technologists, and interprets pushback as insubordination rather than intellectual rigour.

The Revenue Lens. Red answers cluster across multiple categories but consistently reference commercial outcomes, revenue projections, or stakeholder commitments as the deciding factor. Technology decisions are subordinated to the current quarter’s numbers.

The Process Worshipper. Red answers cluster in Delivery and Leadership. The candidate equates process with progress, ceremonies with delivery, and governance with good judgment.

The AI Tourist. Red answers cluster in Data and AI. The candidate treats AI as a buzzword to be deployed for competitive optics rather than a discipline that requires data quality, monitoring, ethical consideration, and properly supported specialists. They see nothing wrong with isolating a single AI engineer in a team that cannot grow, challenge, or manage them.

A Note on Opinionated Technologists

One of the most revealing dimensions of this assessment is how the candidate responds to questions about engineers who push back, disagree, or hold strong technical opinions. Business heads who have succeeded in environments where teams execute instructions often find opinionated technologists threatening. They interpret technical pushback as resistance, disagreement as disloyalty, and independent thinking as a management problem.

The reality is the opposite. The best technology teams are built from opinionated people who care deeply about the work. The role of the leader is not to suppress those opinions but to create an environment where they can be heard, challenged, and synthesised into better decisions. A leader who cannot tolerate dissent will build a team of compliant executors who ship mediocre products on time and wonder why the customers leave.

A Note on Learning While Doing

Business heads with a strong planning orientation often view learning while doing as evidence of failure. If you had planned properly, you would not need to learn anything during execution. This belief is incompatible with technology leadership.

Complex systems cannot be fully understood before they are built. Architecture emerges from contact with reality. Requirements change as users interact with early versions. Performance characteristics only reveal themselves under production load. Security vulnerabilities surface through adversarial testing, not through documentation reviews.

A leader who demands complete certainty before starting will either never start or will force the team to fabricate certainty they do not have, which is worse. The right instinct is to plan enough to reduce the biggest risks, start building, learn from what you discover, and adjust. This is not the absence of planning. It is the only kind of planning that works for complex technology.

A Note on Engineers as Order Takers

The most damaging instinct a business head can carry into a technology organisation is the belief that engineers exist to execute instructions. This mental model treats technology as a cost centre staffed by interchangeable resources whose job is to convert requirements into code on schedule.

In practice, the best engineers carry deep domain knowledge, architectural intuition, and an understanding of how systems behave under stress that cannot be replicated by reading a requirements document. A leader who treats them as order takers will never access this knowledge. They will receive exactly what they ask for, nothing more, and the products they ship will reflect the limits of their own understanding rather than the collective intelligence of the team.

The alternative is to treat every interaction with a technologist as an opportunity to iteratively extract intellectual property. Ask what they think. Ask why they disagree. Ask what they would build if they had the authority. The answers will be better than anything a steering committee can produce.

A Note on the Isolated AI Engineer

Question 25 is one of the most diagnostic questions in this assessment. The pattern it describes is common: an organisation hires a single AI or machine learning engineer, places them in a feature team composed entirely of people from different disciplines, and declares the AI capability embedded.

The candidate who sees nothing wrong with this structure reveals several dangerous blind spots simultaneously.

No quality gate. Machine learning work is unlike conventional software engineering. Model selection, feature engineering, training methodology, bias detection, and evaluation metrics require peer review from people who understand the discipline. An engineer whose work is reviewed only by people who cannot evaluate it is an engineer whose mistakes go undetected.

No career growth. Engineers grow by working alongside people who are better than them, or at least different enough to challenge their assumptions. A single AI engineer in a feature team has no mentor, no sparring partner, and no career path. They will plateau and leave, and the organisation will have to start again.

No management competence. If nobody in the management chain understands what the AI engineer does, nobody can set meaningful objectives, evaluate performance, identify when they are struggling, or advocate for the resources they need. The engineer is simultaneously unsupported and unaccountable.

No intellectual community. AI and machine learning are disciplines where techniques evolve rapidly. An isolated engineer has no internal community of practice, no one to discuss new approaches with, and no one to challenge their methodology. They become a single point of knowledge failure.

The green answer recognises that specialist disciplines need communities of practice. This does not necessarily mean a separate AI team, but it does mean deliberate structures that connect specialists, provide peer review, enable career progression, and ensure management understands the work well enough to support it.

The red answers treat the AI engineer as a fungible delivery resource whose value is measured by output against a timeline, which is the same mistake that drives experienced engineers out of organisations that claim they cannot find talent.

Final Thought

This assessment is not a test of intelligence. It is a test of instinct. Intelligent people can hold damaging instincts. The business head who optimises for reporting, timelines, and compliant teams is not stupid. They are applying a mental model that works in other domains but fails catastrophically in technology.

The purpose of this assessment is to find out which mental model the candidate carries before they are given the keys to a technology organisation and the careers of the people inside it.

Score Sheet

FieldValue
Candidate Name
Assessment Date
Assessor
ScoreCount
Green/40
Amber/40
Red/40
CategoryGreenAmberRed
Leadership (Q1 to Q4)/4/4/4
Architecture (Q5 to Q7)/3/3/3
Delivery (Q8 to Q10)/3/3/3
Technical (Q11 to Q13)/3/3/3
Cloud (Q14 to Q17)/4/4/4
Testing (Q18 to Q20)/3/3/3
Data and AI (Q21 to Q26)/6/6/6
People (Q27 to Q29)/3/3/3
Quality (Q30 to Q32)/3/3/3
Reporting and Planning (Q33 to Q36)/4/4/4
Technologist Relationship (Q37 to Q40)/4/4/4
Red Flag PatternDetected
Timeline Addict
Dashboard Governor
Order Taker Factory
Revenue Lens
Process Worshipper
AI Tourist

|Overall Assessment||
|——————||
|Recommendation ||
|Notes ||


Inspired by Why Andrew Baker Is the World’s Worst CTO

Automatically Recovering a Failed WordPress Instance on AWS

When WordPress goes down on your AWS instance, waiting for manual intervention means downtime and lost revenue. Here are two robust approaches to automatically detect and recover from WordPress failures.

Approach 1: Lambda Based Intelligent Recovery

This approach tries the least disruptive fix first (restarting services) before escalating to a full instance reboot.

Step 1: Create the Health Check Script on Your EC2 Instance

SSH into your WordPress EC2 instance and create the health check script:

sudo tee /usr/local/bin/wordpress-health.sh > /dev/null << 'EOF'
#!/bin/bash
response=$(curl -s -o /dev/null -w "%{http_code}" https://localhost)
if [ $response -eq 200 ]; then
  echo 1
else
  echo 0
fi
EOF

sudo chmod +x /usr/local/bin/wordpress-health.sh

Test it works:

/usr/local/bin/wordpress-health.sh

You should see 1 if WordPress is running.

Step 2: Install CloudWatch Agent on Your EC2 Instance

Still on your EC2 instance, download and install the CloudWatch agent:

wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i -E ./amazon-cloudwatch-agent.deb

Step 3: Create Metric Publishing Script on Your EC2 Instance

This script will send the health check result to CloudWatch every minute:

sudo tee /usr/local/bin/send-wordpress-metric.sh > /dev/null << 'EOF'
#!/bin/bash
INSTANCE_ID=$(ec2-metadata --instance-id | cut -d " " -f 2)
REGION=$(ec2-metadata --availability-zone | cut -d " " -f 2 | sed 's/[a-z]$//')
HEALTH=$(/usr/local/bin/wordpress-health.sh)

aws cloudwatch put-metric-data 
  --namespace "WordPress" 
  --metric-name HealthCheck 
  --value $HEALTH 
  --dimensions Instance=$INSTANCE_ID 
  --region $REGION
EOF

sudo chmod +x /usr/local/bin/send-wordpress-metric.sh

Test it:

/usr/local/bin/send-wordpress-metric.sh

If you get permission errors, ensure your EC2 instance has an IAM role with CloudWatch permissions.

Step 4: Add Health Check to Cron on Your EC2 Instance

This runs the health check every minute:

(crontab -l 2>/dev/null; echo "* * * * * /usr/local/bin/send-wordpress-metric.sh") | crontab -

Verify it was added:

crontab -l

Step 5: Create IAM Role for Lambda on Your Laptop

Now switch to your laptop (or use AWS CloudShell in your browser). You’ll need the AWS CLI installed and configured with credentials.

Create the IAM role that Lambda will use:

aws iam create-role 
  --role-name WordPressRecoveryRole 
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {"Service": "lambda.amazonaws.com"},
      "Action": "sts:AssumeRole"
    }]
  }'

Attach the necessary policies:

aws iam attach-role-policy 
  --role-name WordPressRecoveryRole 
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

aws iam put-role-policy 
  --role-name WordPressRecoveryRole 
  --policy-name EC2SSMAccess 
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "ec2:RebootInstances",
          "ec2:DescribeInstances",
          "ssm:SendCommand",
          "ssm:GetCommandInvocation"
        ],
        "Resource": "*"
      }
    ]
  }'

Step 6: Create Lambda Function on Your Laptop

On your laptop, create a file called wordpress-recovery.py in a new directory:

import boto3
import os
import time

ec2 = boto3.client('ec2')
ssm = boto3.client('ssm')

def lambda_handler(event, context):
    instance_id = os.environ.get('INSTANCE_ID')
    
    if not instance_id:
        return {'statusCode': 400, 'body': 'INSTANCE_ID not configured'}
    
    print(f"WordPress health check failed for {instance_id}")
    
    # Step 1: Try restarting services
    try:
        print("Attempting to restart services...")
        response = ssm.send_command(
            InstanceIds=[instance_id],
            DocumentName='AWS-RunShellScript',
            Parameters={
                'commands': [
                    'systemctl restart php-fpm || systemctl restart php8.2-fpm || systemctl restart php8.1-fpm',
                    'systemctl restart nginx || systemctl restart apache2',
                    'sleep 30',
                    'curl -f https://localhost || exit 1'
                ]
            },
            TimeoutSeconds=120
        )
        
        command_id = response['Command']['CommandId']
        print(f"Command ID: {command_id}")
        
        # Wait for command to complete
        time.sleep(35)
        
        result = ssm.get_command_invocation(
            CommandId=command_id,
            InstanceId=instance_id
        )
        
        if result['Status'] == 'Success':
            print("Services restarted successfully")
            return {'statusCode': 200, 'body': 'Services restarted successfully'}
        else:
            print(f"Service restart failed with status: {result['Status']}")
    
    except Exception as e:
        print(f"Service restart failed with error: {str(e)}")
    
    # Step 2: Reboot the instance as last resort
    try:
        print(f"Rebooting instance {instance_id}")
        ec2.reboot_instances(InstanceIds=[instance_id])
        return {'statusCode': 200, 'body': 'Instance rebooted'}
    except Exception as e:
        print(f"Reboot failed: {str(e)}")
        return {'statusCode': 500, 'body': f'Recovery failed: {str(e)}'}

Create the deployment package:

zip wordpress-recovery.zip wordpress-recovery.py

Get your AWS account ID:

export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

Deploy the Lambda function (replace i-1234567890abcdef0 with your actual instance ID and us-east-1 with your region):

aws lambda create-function 
  --function-name wordpress-recovery 
  --runtime python3.11 
  --role arn:aws:iam::${AWS_ACCOUNT_ID}:role/WordPressRecoveryRole 
  --handler wordpress-recovery.lambda_handler 
  --zip-file fileb://wordpress-recovery.zip 
  --timeout 180 
  --region us-east-1 
  --environment Variables={INSTANCE_ID=i-1234567890abcdef0}

Step 7: Create CloudWatch Alarm on Your Laptop

Replace i-1234567890abcdef0 with your instance ID and us-east-1 with your region:

aws cloudwatch put-metric-alarm 
  --region us-east-1 
  --alarm-name wordpress-down-recovery 
  --alarm-description "Trigger recovery when WordPress is down" 
  --namespace WordPress 
  --metric-name HealthCheck 
  --dimensions Name=Instance,Value=i-1234567890abcdef0 
  --statistic Average 
  --period 300 
  --evaluation-periods 2 
  --threshold 1 
  --comparison-operator LessThanThreshold 
  --treat-missing-data notBreaching

This alarm triggers if the health check fails for 10 minutes (2 periods of 5 minutes each).

Step 8: Connect Alarm to Lambda on Your Laptop

Create an SNS topic (replace us-east-1 with your region):

aws sns create-topic --name wordpress-recovery-topic --region us-east-1

Get the topic ARN:

export TOPIC_ARN=$(aws sns list-topics --region us-east-1 --query 'Topics[?contains(TopicArn, `wordpress-recovery-topic`)].TopicArn' --output text)

Subscribe Lambda to the topic:

aws sns subscribe 
  --region us-east-1 
  --topic-arn ${TOPIC_ARN} 
  --protocol lambda 
  --notification-endpoint arn:aws:lambda:us-east-1:${AWS_ACCOUNT_ID}:function:wordpress-recovery

Give SNS permission to invoke Lambda:

aws lambda add-permission 
  --region us-east-1 
  --function-name wordpress-recovery 
  --statement-id AllowSNSInvoke 
  --action lambda:InvokeFunction 
  --principal sns.amazonaws.com 
  --source-arn ${TOPIC_ARN}

Update the CloudWatch alarm to notify SNS (replace i-1234567890abcdef0 with your instance ID and us-east-1 with your region):

aws cloudwatch put-metric-alarm 
  --region us-east-1 
  --alarm-name wordpress-down-recovery 
  --alarm-description "Trigger recovery when WordPress is down" 
  --namespace WordPress 
  --metric-name HealthCheck 
  --dimensions Name=Instance,Value=i-1234567890abcdef0 
  --statistic Average 
  --period 300 
  --evaluation-periods 2 
  --threshold 1 
  --comparison-operator LessThanThreshold 
  --treat-missing-data notBreaching 
  --alarm-actions ${TOPIC_ARN}

Approach 2: Custom Health Check with CloudWatch Reboot

This approach is simpler than the Lambda version. It uses a custom CloudWatch metric based on checking your WordPress homepage, then automatically reboots when the check fails.

Step 1: Create the Health Check Script on Your EC2 Instance

SSH into your WordPress EC2 instance and create the health check script:

sudo tee /usr/local/bin/wordpress-health.sh > /dev/null << 'EOF'
#!/bin/bash
response=$(curl -s -o /dev/null -w "%{http_code}" https://localhost)
if [ $response -eq 200 ]; then
  echo 1
else
  echo 0
fi
EOF

sudo chmod +x /usr/local/bin/wordpress-health.sh

Test it works:

/usr/local/bin/wordpress-health.sh

You should see 1 if WordPress is running.

Step 2: Create Metric Publishing Script on Your EC2 Instance

This script sends the health check result to CloudWatch:

sudo tee /usr/local/bin/send-wordpress-metric.sh > /dev/null << 'EOF'
#!/bin/bash
INSTANCE_ID=$(ec2-metadata --instance-id | cut -d " " -f 2)
REGION=$(ec2-metadata --availability-zone | cut -d " " -f 2 | sed 's/[a-z]$//')
HEALTH=$(/usr/local/bin/wordpress-health.sh)

aws cloudwatch put-metric-data 
  --namespace "WordPress" 
  --metric-name HealthCheck 
  --value $HEALTH 
  --dimensions Instance=$INSTANCE_ID 
  --region $REGION
EOF

sudo chmod +x /usr/local/bin/send-wordpress-metric.sh

Test it (ensure your EC2 instance has an IAM role with CloudWatch permissions):

/usr/local/bin/send-wordpress-metric.sh

Step 3: Add Health Check to Cron on Your EC2 Instance

Run the health check every minute:

(crontab -l 2>/dev/null; echo "* * * * * /usr/local/bin/send-wordpress-metric.sh") | crontab -

Verify it was added:

crontab -l

Step 4: Create CloudWatch Alarm with Reboot Action on Your Laptop

Now from your laptop (or AWS CloudShell), create the alarm. Replace i-1234567890abcdef0 with your instance ID and us-east-1 with your region:

aws cloudwatch put-metric-alarm 
  --region us-east-1 
  --alarm-name wordpress-health-reboot 
  --alarm-description "Reboot instance when WordPress health check fails" 
  --namespace WordPress 
  --metric-name HealthCheck 
  --dimensions Name=Instance,Value=i-1234567890abcdef0 
  --statistic Average 
  --period 300 
  --evaluation-periods 2 
  --threshold 1 
  --comparison-operator LessThanThreshold 
  --treat-missing-data notBreaching 
  --alarm-actions arn:aws:automate:us-east-1:ec2:reboot

This will reboot your instance if WordPress fails health checks for 10 minutes (2 periods of 5 minutes).

That’s it. The entire setup is contained in 4 steps, and there’s no Lambda function to maintain. When WordPress goes down, CloudWatch will automatically reboot your instance.

Which Approach Should You Use?

Use Lambda Recovery (Approach 1) if:

  • You want intelligent recovery that tries service restart before rebooting
  • You need visibility into what recovery actions are taken
  • You want to extend the logic later (notifications, multiple recovery steps, etc)
  • You have SSM agent installed on your instance

Use Custom Health Check Reboot (Approach 2) if:

  • You want a simple solution with minimal moving parts
  • A full reboot is acceptable for all WordPress failures
  • You don’t need to try service restarts before rebooting
  • You prefer fewer AWS services to maintain

The Lambda approach is more sophisticated and tries to minimize downtime by restarting services first. The custom health check reboot approach is simpler, requires no Lambda function, but always reboots the entire instance.

Testing Your Setup

For Lambda Approach

SSH into your instance and stop nginx:

sudo systemctl stop nginx

Watch the Lambda logs from your laptop:

aws logs tail /aws/lambda/wordpress-recovery --follow --region us-east-1

After 10 minutes, you should see the Lambda function trigger and attempt to restart services.

For Custom Health Check Reboot

SSH into your instance and stop nginx:

sudo systemctl stop nginx

Check that the metric is being sent from your laptop:

aws cloudwatch get-metric-statistics 
  --region us-east-1 
  --namespace WordPress 
  --metric-name HealthCheck 
  --dimensions Name=Instance,Value=i-1234567890abcdef0 
  --start-time $(date -u -d '15 minutes ago' +%Y-%m-%dT%H:%M:%S) 
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) 
  --period 60 
  --statistics Average

You should see values of 0 appearing. After 10 minutes, your instance will automatically reboot.

Both approaches ensure your WordPress site recovers automatically without manual intervention.

TOGAF is to architecture what potatoes are for space travel

You can survive on it for a while. You definitely should not build a mission around it.

1. The analogy nobody asked for, but everyone deserves

Potatoes are incredible. They are calorie dense, resilient, cheap, and historically important. They are also completely useless for space travel. No propulsion, no navigation, no life support, no guidance system. You can eat a potato in space, but you cannot go to space with one.

TOGAF sits in the same category for enterprise architecture. It is nutritionally comforting to executives, historically significant, and endlessly referenced. But as an operating system for modern architecture, it provides no thrust, no trajectory, and no survivability once you leave the launch pad.

2. What TOGAF actually optimises for (and why that is the problem)

TOGAF does not optimise for outcomes. It optimises for process completion and artifact production.

It is exceptionally good at helping organisations answer questions like:

  • Have we completed the phase?
  • Is there a catalog for that?
  • Has the architecture been reviewed?
  • Is the target state documented?

It is almost completely silent on questions that actually matter when building modern systems:

  • How fast can we deploy safely?
  • What happens when this service fails at 02:00?
  • What is the blast radius of a bad release?
  • How do we rotate keys, certificates, and secrets without downtime?
  • How do we prevent a single compromised workload from pivoting across the estate?
  • How do we design for regulatory audits that happen after things go wrong, not before?

TOGAF assumes that architecture is something you design first and then implement. Modern systems prove, daily, that architecture emerges from feedback loops between design, deployment, runtime behaviour, and failure.

TOGAF has no opinion on runtime reality. No opinion on scale. No opinion on latency. No opinion on failure. That alone makes it largely pointless.

3. The ADM: an elegant spiral that never meets production

The Architecture Development Method is often defended as “iterative” and “flexible”. This is technically true in the same way that walking in circles counts as movement.

ADM cycles through vision, business architecture, information systems, technology, opportunities, migration, governance, and change. What it never forces you to do is bind architectural decisions to:

  • Deployment pipelines
  • Observability data
  • Incident postmortems
  • Cost curves
  • Security events
  • Regulatory findings

You can complete the ADM perfectly and still design a system that:

  • Requires weekend release windows
  • Cannot be partially rolled back
  • Fails open instead of failing safe
  • Has shared databases across critical domains
  • Exposes internal services directly to the internet
  • Has no credible disaster recovery story beyond “restore the backup”

That is not iteration. That is documentation orbiting reality.

4. Architecture by artifact is not architecture

TOGAF strongly implies that architecture quality increases as artifacts accumulate. Catalogs, matrices, diagrams, viewpoints, repositories. The organisation feels productive because things are being filled in.

Modern architecture quality increases when:

  • Latency is reduced
  • Failure domains are isolated
  • Dependencies are directional and enforced
  • Data ownership is explicit
  • Security boundaries are non negotiable
  • Change is cheap and reversible

None of these improve because a document exists. They improve because someone made a hard decision and encoded it into infrastructure, platforms, and guardrails.

Artifact driven architecture replaces decision making with description. Description does not prevent outages, fraud, or regulatory breaches. Decisions do.

5. TOGAF governance vs real architectural leverage

TOGAF governance is largely procedural. Reviews, compliance checks, architecture boards, and sign offs. This feels like control, but it is control over paperwork, not over system behaviour.

Real architectural leverage comes from a small number of enforced constraints:

  • No shared databases between domains
  • All services deploy independently
  • All external access terminates through managed gateways
  • Encryption everywhere, no exceptions
  • Secrets never live in code or config files
  • Production access is ephemeral and audited
  • Every system has a defined failure mode

TOGAF does not give you these rules. It gives you a language to debate them endlessly without ever enforcing them.

6. TOGAF certification vs AWS certification in a cloud banking context

This is where TOGAF truly collapses under scrutiny.

Imagine you are designing a cloud based banking app. Payments, savings, lending, regulatory reporting, fraud detection, and customer identity. You have two architects.

Architect A:

  • TOGAF certified
  • Deep knowledge of ADM phases
  • Can produce target state diagrams, capability maps, and principles
  • Strong in stakeholder alignment workshops

Architect B:

  • AWS Solutions Architect Professional
  • AWS Security Specialty
  • AWS Networking Specialty
  • AWS DevOps Professional

Now ask a very simple question. Which one can credibly design and defend the following decisions?

  • Multi account landing zone design with blast radius containment
  • Zero trust network segmentation using cloud native primitives
  • Identity design using federation, least privilege, and break glass access
  • Encryption strategy using managed keys, HSMs, rotation, and separation of duties
  • Secure API exposure using gateways, throttling, and mutual authentication
  • Data residency and regulatory isolation across regions
  • Resilience patterns using multi availability zone and multi region strategies
  • Cost controls using budgets, guardrails, and automated enforcement
  • Incident response integrated with logging, tracing, and alerting
  • CI CD pipelines with automated security, compliance checks, and rollback

A TOGAF certificate prepares you to talk about these topics. Four cloud certifications prepare you to actually design them, build them, and explain their tradeoffs under audit.

In a regulated cloud banking environment, theoretical alignment is worthless. Auditors, regulators, and attackers do not care about your architecture repository. They care about what happens when something fails.

7. What modern architects actually need to know

This is the part TOGAF never touches.

A modern architect must have deep, practical understanding of the primitives the system is built from, not just the boxes on a diagram.

That means understanding cloud primitives at a mechanical level: compute scheduling, storage durability models, network isolation, managed identity, key management, quotas, and failure semantics. Not at a marketing level. At a “what breaks first and why” level.

It means being fluent in infrastructure as code, typically Terraform, and understanding state management, drift, blast radius, module design, promotion across environments, and how mistakes propagate at scale.

It means real security knowledge, not principles. How IAM policies are evaluated, how privilege escalation actually happens, how network paths are exploited, how secrets leak, how attackers move laterally, and how controls fail under pressure.

It means understanding autoscaling algorithms: what metrics drive them, how warm up works, how feedback loops oscillate, how scaling interacts with caches, databases, and downstream dependencies, and how to stop scale from amplifying failure.

It means observability as a first class architectural concern: logs, metrics, traces, sampling, cardinality, alert fatigue, error budgets, and how to debug distributed systems when nothing is obviously broken.

It means durability and resilience: replication models, quorum writes, consistency tradeoffs, recovery point objectives, recovery time objectives, and the uncomfortable reality that backups are often useless when you actually need them.

It means asynchronous offloads everywhere they matter: queues, streams, event driven patterns, back pressure, retry semantics, idempotency, and eventual consistency instead of synchronous coupling.

And yes, it means Kafka or equivalent streaming platforms: partitioning, ordering guarantees, consumer groups, replay, schema evolution, exactly once semantics, and how misuse turns it into a distributed outage generator.

None of this fits neatly into a TOGAF phase. All of it determines whether your bank survives load, failure, fraud, and regulatory scrutiny.

8. Why TOGAF survives despite all of this

TOGAF survives because it is politically safe.

It does not force engineering change. It does not threaten existing delivery models. It does not require platforms, automation, or hard constraints. It can be rolled out without upsetting anyone who benefits from ambiguity.

It allows organisations to claim architectural maturity without confronting architectural debt. It creates the appearance of control while avoiding the discomfort of real decisions.

Like potatoes, it is easy to distribute, easy to consume, and difficult to kill.

9. What architecture actually is in 2026

Modern architecture is not a framework. It is a set of enforced constraints encoded into platforms.

It is the intentional shaping of decision space so teams can move fast without creating systemic risk. It is about reducing coupling, shrinking blast radius, and making failure survivable. It is about designing systems that assume humans will make mistakes and attackers will get in.

If your architecture cannot be inferred from:

  • How your systems deploy
  • How they scale
  • How they fail
  • How they recover
  • How access is controlled
  • How data is isolated
  • How incidents are handled

Then it is not architecture. It is comfort food.

And comfort food has never put a bank safely into the cloud.

Why Agile Was A Bad Idea And Keeps Getting Worse

Or: How We Turned Software Development Into Ticket Farming and Ceremonial Theatre

1. Introduction

Agile started as a rebellion against heavyweight process. It was meant to free teams from Gantt charts, upfront certainty theatre, and waterfall failure modes. Somewhere along the way, Agile became exactly what it claimed to replace: a sprawling, defensible process designed to protect organisations from accountability rather than deliver software.

Worse, every attempt to fix Agile has made it more complex, more rigid, and more ceremonial.

2. Agile’s Fatal Mutation: From Values to Frameworks

The Agile Manifesto was never a methodology. It was a set of values. But values do not sell consulting hours, certifications, or operating models. Frameworks do.

So Agile was industrialised.

We now have a flourishing ecosystem of Agile frameworks, each promising to scale agility while quietly suffocating it. SAFe is the most egregious example, but not the only one. These frameworks are so complex that they require diagrams that look like subway maps and multi day training courses just to explain the roles.

When a process designed to reduce complexity requires a full time role just to administer it, something has gone badly wrong.

Frustrated developer staring at computer screen with sticky notes covering monitor

Framework proliferation map showing how Agile spawned more governance than it replaced.

3. The Absurdity of Sprints

Few terms expose Agile’s intellectual dishonesty better than sprint.

A sprint is supposedly about speed, adaptability, and urgency. Yet in Agile practice, it is a fixed two week time box, planned in advance, estimated upfront, and reviewed retrospectively. There is nothing sprint like about it.

Calling a two week planning cycle a sprint is like calling a commuter train a race car.

Agile claims to embrace change, yet its core execution model actively resists it. Once work is committed to a sprint, change becomes scope creep rather than reality. The language is agile; the behaviour is rigid.

Software development team looking frustrated during agile standup meeting

The sprint paradox showing fixed time boxes masquerading as agility.

4. SAFe™ and the Industrialisation of Complexity (2026 Reality Check)

If SAFe™ was already bloated, the 2026 updates pushed it into full blown institutional absurdity.

The framework did not simplify. It did not converge. It did not correct course. It expanded. More roles. More layers. More artefacts. More synchronisation points. Every release claims to “reduce cognitive load” while aggressively increasing it.

SAFe™ in 2026 is no longer a delivery framework. It is a consulting extraction model.

4.1 Complexity Is the Product

The defining feature of modern SAFe™ is not agility. It is deliberate complexity.

The framework is now:

  • Too large for leadership to understand
  • Too abstract for engineers to respect
  • Too entrenched to remove once adopted

This is not accidental design failure. This is commercial optimisation.

SAFe™ is engineered to require:

  • Continuous certification
  • Ongoing retraining
  • Specialist roles that only external consultants can interpret
  • Diagrammatic sprawl that requires facilitation just to explain

If a framework needs paid interpreters to function, it is not a framework. It is a revenue stream.

4.2 Predatory Economics and Executive Ignorance

The 2026 SAFe™ model preys on a structural weakness in large organisations: technical illiteracy at the top.

Executives who do not understand software delivery are uniquely vulnerable to frameworks that look sophisticated, sound authoritative, and promise control. SAFe™ exploits this perfectly.

It sells:

  • Alignment instead of speed
  • Governance instead of ownership
  • Artefacts instead of outcomes
  • Process instead of production

Large consultancies thrive here. They do not fix delivery. They prolong transformation. Every new SAFe™ revision conveniently creates new problems that only certified experts can solve.

This is not transformation. It is dependency creation.

4.3 Safety Theatre for Leadership

SAFe™ does not optimise for delivery. It optimises for defensibility.

When delivery fails, leaders can say:

  • We followed the framework
  • We invested in training
  • We implemented best practice
  • We had alignment

Responsibility dissolves into ceremony.

SAFe™ provides political cover. It allows leadership to appear decisive without being accountable. Failure becomes systemic, not personal. That is its real value proposition.

4.4 Role Inflation as a Symptom of Collapse

The 2026 updates doubled down on role inflation:

  • More architects to manage architectural drift
  • More product roles to manage backlog confusion
  • More portfolio layers to manage coordination failure
  • More councils to manage decision paralysis

Each new role exists to compensate for the damage caused by the previous role.

This is not scale. This is organisational recursion.

4.5 Why SAFe™ Cannot Be Fixed

SAFe™ cannot be simplified without destroying its economic model.

If it were reduced to:

  • Small autonomous domain teams
  • Clear end to end ownership
  • Direct paths to production
  • Continuous deployment

There would be nothing left to certify. Nothing left to consult. Nothing left to sell.

So complexity grows. Terminology mutates. Diagrams expand. Billable hours increase.

This is not a failure of SAFe™.

This is SAFe™ working exactly as designed.

Frustrated developer at computer with messy code on screen

SAFe complexity diagram illustrating role and process sprawl

5. Alignment Is a Poor Substitute for Velocity

Agile frameworks obsess over alignment. Align the teams. Align the backlogs. Align the ceremonies. Align the planning cycles.

Alignment feels productive, but it is not speed.

True velocity comes from segregation and autonomy, not synchronisation. Teams that own domains end to end move faster than teams that are perfectly aligned but constantly waiting on one another.

Alignment optimises for consensus. Autonomy optimises for outcomes.

In practice, Agile alignment produces shared delays, shared dependencies, and shared excuses. Velocity dies quietly while everyone agrees on why.

6. Agile as a Ticket Collection System

Modern Agile organisations are not delivery machines. They are ticket processing plants.

Engineers spend an extraordinary amount of time creating tickets, grooming tickets, estimating tickets, updating ticket status, and explaining why tickets moved or did not move.

This is administrative work wrapped in the language of delivery.

Burn down charts are the pinnacle of this illusion. They show activity, not value. They measure compliance with a plan, not impact in production. They exist to reassure stakeholders, not users.

Frustrated developer staring at computer screen surrounded by sticky notes

The ticket lifecycle showing how work multiplies without increasing value.

7. Burn Down Charts Are a Waste of Time

Burn down charts answer exactly one unimportant question: are we progressing against the plan we made two weeks ago?

They tell you nothing about whether the software is useful, whether users are happier, whether the system is more stable, or whether deployment is easier or safer.

They are historical artefacts, not decision tools. By the time a burn down chart reveals a problem, it is already too late to matter.

8. Engineer the Path to Production, Not a Defensible Process

Agile made a critical mistake: it focused on process before engineering.

Real agility comes from automated testing, trunk based development, feature flags, observability, continuous integration, and continuous deployment.

You do not become agile by following a defensible process. You become agile by engineering a path to production that is boring, repeatable, and safe.

A release pipeline beats a retrospective every time.

9. Continuous Deployment Is What Agile Pretends to Be

If agility means responding quickly to change, then continuous deployment is agility in its purest form.

No sprints
No ceremonies
No artificial planning cycles

Just small changes, shipped frequently, with fast feedback.

Continuous deployment forces discipline where it matters: in code quality, test coverage, and system design. It removes the need for most Agile theatre because progress is visible in production, not on a board.

Infographic placeholder (JPEG):
Sprints versus continuous deployment showing time boxed delivery versus continuous flow.

10. Domains Beat Ceremonies

The most effective organisations do not scale Agile. They decouple themselves.

They organise around business domains, not backlogs. Teams own problems end to end. Dependencies are minimised by design, not managed through meetings.

This reduces coordination overhead, alignment ceremonies, and cross team negotiation, while increasing accountability, speed, quality, and ownership.

No framework can substitute for this.

11. Conclusion: Agile Isn’t Dead, But It Should Be

Agile failed not because its original ideas were wrong, but because organisations turned values into process and flexibility into dogma.

What remains today is ceremony without speed, alignment without autonomy, measurement without meaning, and process without production.

Agile did not make organisations adaptive. It made them defensible.

Real agility lives in engineering, autonomy, and production reality. Everything else is theatre.

The Dishonest Process of Technology Planning

1. Estimation Fails Exactly Where It Is Demanded Most

Estimation is most aggressively demanded in workstreams with the highest discovery, the highest uncertainty, and the highest intellectual property density. This is not an accident. The more uncomfortable the terrain, the more organisations reach for the false comfort of numbers. In these environments, estimation is not just wrong, it is structurally impossible. You are being asked to predict learning that has not yet occurred, risks that have not yet surfaced, and constraints that do not yet exist. This is not planning. It is numerology.

High discovery work is, by definition, about finding the problem while solving it. High IP work is about creating something that did not exist before. Estimation assumes a known path. Discovery assumes there is no path. These two ideas are incompatible.

Person presenting technology roadmap on whiteboard to seated colleagues in meeting room

2. Chess Is the Simplest Proof That Estimation Is Nonsense

Try estimating how long a game of chess will take. You cannot. The number of possible games exceeds any tractable search space. Two players, same rules, same board, radically different outcomes every time. You can window the opening because it is memorised. You can vaguely reason about the endgame because the state space has collapsed. The middle game, where real thinking happens, is unknowable until it is played.

Planning a game of chess in advance takes longer than actually playing it. To plan properly, you would need to analyse millions of branches that will never occur. This is exactly what technology programmes do when they insist on detailed delivery plans upfront. Months are spent modelling futures that reality will immediately invalidate.

The more time you spend estimating, the less time you spend learning. Learning is the only thing that reduces uncertainty.

3. Windows, Not Dates. Risk, Not False Precision

Dates create the illusion of certainty. Windows acknowledge reality. In high discovery work, the only honest outputs are windows, complexity signals, and risk indicators. Anything else is theatre.

No estimates should exist until the work is at least thirty percent complete. Before that point, you do not understand the shape of the problem, the resistance in the system, or the real integration costs. Early estimates are not conservative. They are random. Worse, they anchor expectations that will later be enforced as if they were commitments.

A window communicates intent without lying. A risk indicator communicates maturity without false confidence. This is not weakness. It is professional integrity.

4. A Proper Plan Is an Oxymoron

There is no such thing as a proper plan in technology. All plans are improper. Some are merely less wrong than others. Technology shifts underneath you. Dependencies move. Assumptions expire. What was optimal yesterday becomes harmful tomorrow.

Plans are snapshots of ignorance taken at a moment in time. Treating them as commitments rather than hypotheses is how organisations accumulate failure. The correct posture is not adherence to plan, but continuous replanning based on what you have learned since the last decision.

If your plan cannot survive daily contact with reality, it is not a plan. It is a liability.

5. Technology Planning Is Organisational Self Harm

Heavy investment in technology planning is a form of self harm. It is indulgent, expensive, and emotionally motivated. Its primary purpose is not delivery, but the calming of executive nerves through the illusion of control.

Planning artefacts grow precisely when control is lowest and risk is highest. Roadmaps thicken. Gantt charts multiply. Governance forums expand. None of this reduces uncertainty. It simply diverts energy away from learning and into defending a narrative.

This is the lie at the heart of technology planning. Control is low. Risk is high. Pretending otherwise does not make it safer. It makes it slower and more fragile.

Accept your reality. Put your energy into conquering the truth, not defending a lie. Every hour spent polishing a plan that reality will invalidate is an hour stolen from building, testing, integrating, and learning. Planning feels productive. Learning actually is.

Technology planning meeting with scattered documents and confused stakeholders

6. Everyone Has a Plan Until Reality Hits

“Everyone has a plan until they get punched in the face.” — Mike Tyson. Technology workstreams deliver that punch early, repeatedly, and without mercy.

Technology workstreams are not a single surprise. They are a sustained confrontation with reality. Legacy systems hit first. Data quality follows. Performance collapses under real load. Security assumptions evaporate. Users behave nothing like your models. Every one of these moments is a correction. None of them appear on the plan.

This is why planning confidence collapses so quickly once real work begins. Technology does not negotiate. It does not respect roadmaps. It reveals itself incrementally and relentlessly, one constraint at a time. The job is not to defend the plan after reality intrudes. The job is to stay standing and adapt faster than the next constraint reveals itself.

7. Interdependencies Are the Real Enemy

Most delivery failure is not caused by individual team performance. It is caused by interdependencies between teams, systems, environments, and decision makers. Estimation does not solve this. It hides it.

The only real remedy for interdependencies is to break them. Mocks, stubs, contracts, simulators, and fake services exist so that teams can move independently while reality catches up later. Waiting for another team to be ready is not coordination. It is organisational paralysis.

If your critical path depends on another team, your plan is already broken. Break the dependency or accept the delay. There is no third option.

8. Chase a Path to Production Relentlessly

You must chase a path to production from day one. Avoid the big reveal. Big reveals are how trust dies. They create a long silence followed by a single high risk moment where reality finally gets a vote.

Technology must deliver production value early, even if that value is small, partial, or hidden behind flags. The goal is not feature completeness. The goal is proving that the system can breathe in production conditions. Latency, security, deployment friction, data quality, and operational pain surface only when real traffic exists.

Delivery anxiety is a real force. You can only hold back the dams for so long. If value does not flow early, pressure builds, shortcuts appear, and quality becomes negotiable. Early production exposure releases pressure safely and continuously.

9. Shipping Dates to Exco Is Choosing Vanity Over Your Team

When you ship a date to an exco in a high discovery, high IP environment, you are not being accountable. You are choosing vanity over your team. You are signalling confidence you do not possess in order to look in control.

Ask yourself what you are really expecting your team to do. Do you expect them to ship rubbish into production on that date to protect the narrative? Do you expect them to quietly disagree but say nothing, pretending they accepted your made up certainty? When the date slips, will you say something “unforeseen” happened?

Of course it was unforeseen. That is the nature of high IP work. Calling it unforeseen does not make it exceptional. It makes the original date dishonest.

Dates force teams into impossible ethical corners. Either degrade quality, lie about progress, or absorb blame for a fiction they did not create. All three outcomes burn trust. None of them improve delivery.

Do not burn trust by shipping a date. Instead, ship a risk pack.

A proper risk pack shows what you are in for. It shows that you understand the terrain, the uncertainty, and the commercial exposure. It shows a credible route to delivering production value early, not a promise of completeness later. It demonstrates that the work can be made commercially viable through staged value, controlled exposure, and fast learning.

What exco actually needs is confidence that you are focused on delivery, speed, quality, and risk, not that you can guess the future. Dates satisfy anxiety. Risk packs build trust.

10. No Estimates and the Discipline of Reality

Woody Zuill’s No Estimates work is often misunderstood as anti planning. It is not. It is anti fiction. The core idea is simple. Focus on delivering small, valuable, production ready slices and use actual throughput as your only credible signal.

When teams stop estimating and start finishing, predictability emerges as a side effect. Not because the future became knowable, but because feedback loops became short. Work items are refined until they are small enough to complete safely. Risk is exposed immediately, not deferred behind optimistic forecasts.

No Estimates is not about refusing to answer questions. It is about refusing to lie. When asked how long something will take, the honest answer in high discovery work is what we have learned so far, what remains uncertain, and what we will try next.

11. Technology Change Is War

All technology change is a war. There is always an opponent, even if you pretend there is not. Legacy systems resist you. Data surprises you. Performance collapses under load. Users behave in ways your models never predicted. Every move reveals a counter move.

War is painful. It is humbling. You are always wrong, just in different ways over time. The only winning strategy is speed, decisiveness, and daily engagement. Monthly steerco updates are irrelevant. By the time you present the slide, the battlefield has already shifted.

If you are not all in, every day, close to the work, give it to someone else to run. This is not a governance problem. It is a leadership problem.

12. Relentless Adaptation Beats Perfect Prediction

The strongest teams do not pretend they are right. They constantly declare what did not work and what they are going to change next. This is not failure. This is competence made visible.

Never give up quality to meet a date. Dates recover. Quality debt compounds. Once trust in the system is gone, no timeline will save you. The goal is not to look predictable. The goal is to be effective in an environment that refuses to be predictable.

Stop estimating the unknowable. Shorten the feedback loop. Break dependencies. Chase production early. Declare learning openly. Move, counter move, and stay in the fight.

That is the only plan that works.

The 10 Biggest Differences Between Windows Server and Linux for Enterprises

Enterprise operating systems for servers, are not chosen because they are liked. They are chosen because they survive stress. At scale, an operating system stops being a piece of software and becomes an amplifier of either discipline or entropy. Every abstraction, compatibility promise, and hidden convenience eventually expresses itself under load, during failure, or in a security review that nobody budgeted for.

This is not a desktop comparison. This is about the ugly work at the backend of enterprise applications and systems – where uptime is contractual, reputational, security incidents are existential, and operational drag quietly compounds until the organisation slows without understanding why.

Server comparison chart showing Windows and Linux operating system differences

1. Philosophy: Who the Operating System Is Actually Built For

Windows was designed around people. Linux was designed around workloads.

That single distinction explains almost everything that follows. Windows prioritises interaction, compatibility, and continuity across decades of application assumptions. Linux prioritises explicit control, even when that control is sharp edged and unforgiving.

In an enterprise environment, friendliness is rarely free. Every convenience hides a decision that an operator did not explicitly make. Linux assumes competence and demands intent. Windows assumes ambiguity and tries to smooth it over. At scale, smoothing becomes interference.

2. Kernel Architecture: Determinism, Path Length, and Control

Linux uses a monolithic kernel with loadable modules, not because it is ideologically pure, but because it is fast, inspectable, and predictable. Critical subsystems such as scheduling, memory management, networking, and block IO live in kernel space and communicate with minimal indirection. When a packet arrives or a syscall executes, the path it takes through the system is short and largely knowable.

This matters because enterprise failures rarely come from obvious bottlenecks. They come from variance. When latency spikes, when throughput collapses, when jitter appears under sustained load, operators need to reason about cause and effect. Linux makes this possible because the kernel exposes its internals aggressively. Schedulers are tunable. Queues are visible. Locks are measurable. The system does very little “on your behalf” without telling you.

Windows uses a hybrid kernel architecture that blends monolithic and microkernel ideas. This enables flexibility, portability, and decades of backward compatibility. It also introduces more abstraction layers between hardware, kernel services, and user space. Under moderate load this works well. Under sustained load, it introduces variance that is hard to model and harder to eliminate.

The result is not lower average performance, but wider tail latency. In enterprise systems, tail latency is what breaks SLAs, overloads downstream systems, and triggers cascading failures. Linux kernels are routinely tuned for single purpose workloads precisely to collapse that variance. Windows kernels are generalised by design.

Windows and Linux server comparison charts on computer screens

3. Memory Management: Explicit Scarcity Versus Deferred Reality

Linux treats memory as a scarce, contested resource that must be actively governed. Operators decide whether overcommit is allowed, how aggressively the page cache behaves, which workloads are protected, and which ones are expendable. NUMA placement, HugePages, and cgroup limits exist because memory pressure is expected, not exceptional.

When Linux runs out of memory, it makes a decision. That decision may be brutal, but it is explicit.

Windows abstracts memory pressure for as long as possible. Paging, trimming, and background heuristics attempt to preserve system responsiveness without surfacing the underlying scarcity. When pressure becomes unavoidable, intervention is often global rather than targeted. In dense enterprise environments this leads to cascading degradation rather than isolated failure.

Linux enables intentional oversubscription as an engineering strategy. Windows often experiences accidental oversubscription as an operational surprise.

4. Restart Time and the Physics of Recovery

Linux assumes restarts are normal. As a result, they are fast. Kernel updates, configuration changes, and service restarts are treated as routine events. Reboots measured in seconds are common. Live patching reduces the need for them even further.

Windows treats restarts as significant milestones. Updates are bundled, sequenced, narrated, and frequently require multiple reboots. Maintenance windows expand not because the change is risky, but because the platform is slow to settle.

Mean time to recovery is a hard physical constraint. When a system takes ten minutes to come back instead of ten seconds, failure domains grow even if the original fault was small.

5. Bloat as Operational Debt, Not Disk Consumption

A Windows server often ships with a GUI, a browser, legacy subsystems, and optional features enabled by default. Each of these components must be patched, monitored, and defended whether they are used or not.

Linux distributions assume absence. You install what you need and nothing else. BusyBox demonstrates the extreme: one binary, dozens of capabilities, minimal surface area. This is not aesthetic minimalism. It is operational discipline.

Every unused component is latent liability. Linux is designed to minimise the number of things that exist.

Windows and Linux server comparison chart on computer screen

6. Licensing Costs as a Systems Design Constraint

Linux licensing is deliberately dull. Costs scale predictably. Capacity planning is an engineering exercise, not a legal one.

Windows licensing scales with cores, editions, features, and access models. At small scale this is manageable. At large scale it starts influencing topology. Architects begin shaping systems around licensing thresholds rather than fault domains.

When licensing dictates architecture, reliability becomes secondary to compliance.

7. Networking, XDP, and eBPF: Policy at Line Rate

Linux treats the kernel as a programmable execution environment. With XDP and eBPF, packets can be inspected, redirected, or dropped before they meaningfully enter the networking stack. This allows DDoS mitigation, traffic shaping, observability, and enforcement at line rate.

This is not a performance optimisation. It is a relocation of control. Policy moves into the kernel. Infrastructure becomes introspective and reactive.

Windows networking is capable, but it does not expose equivalent in kernel programmability. As enterprises move toward zero trust, service meshes, and real time enforcement, Linux aligns naturally with those needs.

Windows and Linux server comparison chart on computer screen

8. Containers as a Native Primitive, Not a Feature

Linux containers are not lightweight virtual machines. They are namespaces and control groups enforced by the kernel itself. This makes them predictable, cheap, and dense.

Windows containers exist, but they are heavier and less uniform. They rely on more layers and assumptions, which reduces density and increases operational variance.

Kubernetes did not emerge accidentally on Linux. It emerged because the primitives already existed.

9. Security Reality: Patch Gravity and Structural Exposure

Windows security is not weak because of negligence. It is fragile because of accumulated complexity.

A modern Windows enterprise stack requires constant patching across the operating system, the .NET runtime, PowerShell, IIS, legacy components kept alive for compatibility, and a long tail of bundled services that cannot easily be removed. Each layer brings its own CVEs, its own patch cadence, and its own regression risk. Patch cycles become continuous rather than episodic.

The .NET runtime is a prime example. It is powerful, expansive, and deeply embedded. It also requires frequent security updates that ripple through application stacks. Patching .NET is not a simple upgrade. It is a dependency exercise that demands testing across frameworks, libraries, and deployment pipelines.

Windows’ security model reflects its history as a general purpose platform. Backward compatibility is sacred. Legacy APIs persist. Optional components remain present even when unused. Security tooling becomes additive: agents layered on top of agents to compensate for surface area that cannot be removed.

Linux takes a subtractive approach. If a runtime is not installed, it cannot be exploited. Mandatory access controls such as SELinux and AppArmor constrain blast radius at the kernel level. Fewer components exist by default, which reduces the number of things that need constant attention.

Windows security is a campaign. Linux security is structural.

10. Stability as the Absence of Surprise

Linux systems often run for years not because they are neglected, but because updates rarely force disruption. Drivers, filesystems, and subsystems evolve quietly.

Windows stability has improved significantly, but its operational model still assumes periodic interruption. Reboots are expected. Downtime is normalised.

Enterprise stability is not about never failing. It is about failing in ways that are predictable, bounded, and quickly reversible.

Final Thought: Invisibility Is the Goal

Windows integrates. Linux disappears.

Windows participates in the system. Linux becomes the substrate beneath it. In enterprise environments, invisibility is not a weakness. It is the highest compliment.

If your operating system demands attention in production, it is already costing you more than you think. Linux is designed to avoid being noticed. Windows is designed to be experienced.

At scale, that philosophical difference becomes destiny.

Stability : The Water of Life for Engineering

Why do Companies Get Stability So Wrong?

Most companies do not fail because they cannot innovate. They fail because they misjudge stability.

Some organisations under invest. They chase features, growth, and deadlines while stability quietly drains away. Outages feel sudden. Incidents feel unfair. Leadership asks how this happened “out of nowhere”.

Other organisations over invest. They build process on process, reviews on reviews, controls on controls. Delivery slows to a crawl. Engineers disengage. The system becomes stable but irrelevant. Eventually the business collapses under its own weight. Both groups are wrong for the same reason.

They treat stability as a thing you can reason about intellectually instead of a resource that behaves physically. Most corporate conversations about stability sound like this:

  • “Are we stable enough?”
  • “Do we need more resilience?”
  • “Let’s prioritise reliability this quarter”
  • “Teams can work on stability when they think it’s needed”

These are the wrong questions. Stability is not binary. It is not something you have or do not have. It is something that is constantly leaking away.

Entropy never pauses.
Complexity always grows.
Dependencies always drift.

So the real question is not how much stability do we want? It is how do humans reliably maintain something that is always degrading, even when it feels fine?

To answer that, it helps to stop thinking like executives and start thinking like biology. And that brings us to a very simple walking experiment.

1. A Simple Walking Experiment

Imagine three groups of walkers.
All three walk at exactly 5 km per hour.
The terrain is the same.
The weather is the same.
The only difference is how they consume water.

Engineer reviewing system diagrams on computer screen for stability analysis

This is not a story about hydration. It is a story about engineering stability.

Group 1: No Water

This group decides they will push through.
Water is optional. They feel strong. They feel fine.

No surprises. they fail after 3 hours.

Group 2: Unlimited Water

This group has all the water they could ever want. Drink whenever you feel like it. No limits. No rules.

This group goes longer, BUT still fails after 6 hours.

Group 3: One Cup Every 15 Minutes

This group is forced to drink one cup of water every 15 minutes. Even if they are not thirsty. Even if they feel fine. Even if they think it is unnecessary.

They walk forever.

2. Who Wins and Why?

The obvious loser is Group 1. Deprivation always kills you quickly.

But the surprising failure is Group 2. Unlimited water feels like safety. It feels mature. It feels trusting. Yet it still fails. Why?

Because humans are terrible at sensing slow degradation. By the time thirst is obvious, damage is already done. By the time things feel unstable, they are likely in already in a really bad place.

Group 3 wins not because they are smarter.
They win because they removed judgment from the system.

3. Stability Is Like Water

Stability in engineering behaves exactly like hydration. It is:

  • Always leaking away
  • Always trending down
  • Never something you “finish”

You do not reach a stable system and stop.
You only slow the rate at which entropy wins.

The moment you stop drinking, dehydration begins. The moment you stop investing in stability, decay begins. There is no neutral state.

4. Why does “Do It When You Need It” Fail?

Many teams treat stability like Group 2 treats water.

“We can fix reliability whenever we want.”
“We have budget for it.”
“We will focus on it after this delivery.”
“We are stable enough right now.”

This is a lie we tell ourselves because:

  • Instability accumulates silently
  • Risk compounds invisibly
  • Pain arrives late and all at once

Your appetite for stability is not accurate.
Your perception lags reality. By the time engineers feel the pain:

  • Pager load is already high
  • Cognitive load is already maxed
  • Trust in the system is already gone

5. Why Forced, Small, Regular Work Wins

Group 3 survives because the rule is boring, repetitive, and non negotiable.

One cup.
Every 15 minutes.
No debate.

Engineering stability works the same way.

Small actions:

  • Reviewing error budgets
  • Paying down tiny bits of tech debt
  • Exercising failovers
  • Reading logs when nothing is broken
  • Testing restores even when backups “worked last time”

These actions feel unnecessary right up until they are existential.

The key insight is this:

Stability must be regular, small, and forced, not discretionary.

6. Carte Blanche Stability Still Fails

Giving teams unlimited freedom to “do stability whenever they want” feels empowering. It is not. It creates:

  • Deferral
  • Rationalisation
  • Optimism bias
  • Hero culture

Just like unlimited water, people will drink:

  • Too late
  • Too little
  • Only when discomfort appears

And discomfort always appears after damage.

7. Stability Is Not a Project

You do not “do stability”. You consume it continuously. Miss a few intervals and you do not notice. Miss enough and you collapse suddenly. This is why outages feel unfair. “This came out of nowhere.” – it never did. You authored it, when you made stability a choice.

8. The Temporary Uplift of New Leadership and Why It Fades

There is a familiar pattern in many organisations.

New leadership arrives.
Energy lifts.
Standards tighten.
Questions get sharper.
Long ignored issues suddenly move.

For a while, stability improves.

This uplift is real, but it is also temporary.

Why?

Because much of the early improvement does not come from structural change.
It comes from attention.

People prepare more.
Risks are surfaced that were previously hidden.
Teams clean things up because someone is finally looking.

But attention is not a system. It does not scale. And it does not last. Over time, leaders get pulled upward and outward:

  • Strategy
  • Budgets
  • Politics
  • External pressure

The deep, uncomfortable details fade from view again. Entropy resumes its work. Eventually the organisation concludes it needs:

  • A new leader
  • A new structure
  • Another reset

And the cycle repeats.

8.1 Inspection Is Not Optional

John Maxwell captured this simply:

“What you do not inspect, you cannot expect.”

Stability is not maintained by policy. It is maintained by inspection. Leaders cannot delegate this entirely.

Dashboards help, but they are abstractions.
Audits help, but they are compliance driven.
Neither replaces technical curiosity.

8.2 Why Audits Miss the Real Risks

Auditors are necessary, but they are constrained:

  • They work to checklists
  • They assess evidence, not behaviour
  • They validate controls, not fragility

They rarely ask:

  • What happens under load?
  • What breaks first?
  • What do engineers silently work around?
  • Where are we “hoping” things hold?

A technically competent leader, even without writing code daily, will notice:

  • Architectural smells
  • Operational anti patterns
  • Client complains
  • Excessive handoffs during fault resolution
  • Risk concentration
  • Overly large blast radii
  • “Accepted” risks no one remembers accepting

These things do not show up in audit findings.
They show up in deep dives.

8.3 Leadership Must Periodically Go to the Gemba

If leaders want stability to persist beyond their honeymoon period, they must:

  • Periodically deep dive the estate
  • Sit with engineers in the details
  • Review real incidents, not summaries
  • Ask uncomfortable “what if” questions

Not continuously. But deliberately. And repeatedly. This does two things:

  • It resets attention on the highest risks
  • It reinforces that stability is not someone else’s job

8.4 Sustainable Stability Outlives Leaders

The goal is not to rely on heroic leaders. The goal is to build systems where:

  • Risk surfaces automatically
  • Attention is forced by mechanisms
  • Leaders amplify the system instead of substituting for it

New leadership should improve things.
But stability should not depend on leadership churn. When stability only improves after a reset at the top, it is already leaking. The strongest organisations use leadership attention to reinforce cadence, not replace it.

9. The Engineering Lesson

Great engineering organisations do not trust feelings. They trust cadence. They bake stability into time:

  • Weekly reliability work
  • Fixed chaos testing intervals
  • Mandatory post incident learning
  • Forced operational hygiene

Even when everything looks fine. Especially when everything looks fine. Because that is when dehydration is already happening.

10. Conclusion: Turning Stability from Belief into Mechanism

Stability does not survive on intent.
It survives on structure.

Most organisations say the right things about reliability, resilience, and operational excellence. Very few hard code those beliefs into how work actually gets done.

If stability depends on motivation, maturity, or “good engineering culture”, it will decay.
Those things fluctuate. Entropy does not.

The only way stability survives at scale is when it is embedded as a forced, recurring behaviour.

10.1 Make Stability Time Non Negotiable

The first rule is simple: stability must have reserved time.

Set aside a fixed day each week, or a fixed percentage of capacity, that is explicitly not for delivery:

  • Automation
  • Observability improvements
  • Reducing operational toil
  • Fixing recurring incidents
  • Removing fragile dependencies

This time should not be borrowable.
It should not be traded for deadlines.
If it disappears under pressure, it was never real to begin with.

Just like forced hydration, the value is not in intensity.
It is in cadence.

10.2 Always Run a Short Cycle Risk Rewrite Program

High risk systems should never wait for a “big modernisation”.

Instead, always run a rolling program that:

  • Identifies the highest risk systems
  • Rewrites or refactors them in small, contained slices
  • Finishes something every cycle

This creates two critical properties:

  • Risk is continuously reduced, not deferred
  • Engineers stay close to production reality

Long lived, untouched systems are where entropy concentrates.
Short cycles keep decay visible.

10.3 Encode Stability as Hard Parameters

The most important shift is this:
stop debating risk and start flushing it out mechanically.

Introduce explicit constraints that surface outsized risk early, for example:

  • Maximum database size: 10 TB
  • Maximum service restart time: 10 minutes
  • Maximum patch age: 3 months
  • Maximum server size: 64 CPUs
  • Maximum operating system age: 5 years
  • Maximum sustained IOPS: 60k
  • Maximum acceptable outage per incident: 30 minutes

These numbers do not need to be perfect.
They need to exist.

When a system crosses one of these thresholds, it triggers a conversation. Not a blame exercise. A prioritisation discussion.

The goal is not to prevent exceptions. The goal is to make embedded, accepted risk visible.

10.4 Adjust the Numbers, Never the Principle

Over time, these parameters will change:

  • Hardware improves
  • Tooling matures
  • Teams get stronger

That is fine.

What must never change is the mechanism:

  • Explicit limits
  • Automatic signalling
  • Early discussion
  • Intentional action

This is how you prevent stability debt from silently compounding.

10.5 Stability Wins When It Is Boring

The organisations that endure do not heroically fix stability problems in crises.
They routinely prevent them in boring ways.

Small actions.
Forced cadence.
Hard limits.

That is how Group 3 walks forever.

Stability is not something you believe in. It is something you operationalise. And if you do not embed it mechanically, entropy will do the embedding for you.

The Famine of Wisdom in the Age of Data Gluttony

Why More Information Doesn’t Mean More Understanding

We’ve all heard the mantra: data is the new oil. It’s become the rallying cry of digital transformation programmes, investor pitches, and boardroom strategy sessions. But here’s what nobody mentions when they trot out that tired metaphor: oil stinks. It’s toxic. It’s extraordinarily difficult to extract. It requires massive infrastructure, specialised expertise, and relentless refinement before it becomes anything remotely useful. And even then, used carelessly, it poisons everything it touches.

The comparison is more apt than the evangelists realise.

1. The Great Deception

Somewhere along the way, we convinced ourselves that accumulating information was synonymous with gaining understanding. That if we could just capture enough data points, build enough dashboards, and train enough models, clarity would emerge from the chaos. This is perhaps the most dangerous illusion of the modern enterprise.

I’ve watched organisations drown in their own data lakes, though calling them lakes is generous. Most are swamps. Murky, poorly mapped, filled with debris from abandoned projects and undocumented schema changes. Petabytes of customer interactions, transaction logs, sensor readings, and behavioural metrics, all meticulously captured, haphazardly catalogued, and largely ignored. The dashboards multiply. The reports proliferate. And yet the fundamental questions remain unanswered: What should we do? Why are we doing it? What does success actually look like?

Information is not knowledge. Knowledge is not wisdom. And wisdom is not guaranteed by any quantity of the preceding.

2. The Refinement Problem

Crude oil, freshly extracted, is nearly useless. It must be transported, heated, distilled, treated, and transformed through dozens of processes before it becomes the fuel that powers anything. Each step requires expertise, infrastructure, and enormous capital investment. Skip any step, and you’re left with toxic sludge.

Data follows the same brutal economics. Raw data is not an asset. It’s a liability. It costs money to store, creates security and privacy risks, and generates precisely zero value until someone with genuine expertise transforms it into something actionable. Yet organisations hoard data like digital dragons sitting on mountains of gold, convinced that possession equals wealth.

The transformation from data to wisdom requires multiple refinement stages: Data must become information through structure and context. Information must become knowledge through analysis and interpretation. Knowledge must become wisdom through experience, judgement, and critically, self awareness. Each transition demands different skills, different tools, and different kinds of thinking. Most organisations have invested heavily in the first transition and almost nothing in the rest.

3. Tortured Data Will Confess Anything

There’s an old saying among statisticians: torture the data long enough and it will confess to anything. This isn’t a joke. It’s a warning that most organisations have failed to heed.

With enough variables, enough segmentation, and enough creative reframing, you can make data support almost any conclusion you’ve already decided upon. This is the dark side of sophisticated analytics: the tools that should illuminate truth become instruments of confirmation bias. The analyst who brings inconvenient findings gets asked to “look at it differently.” The dashboard that shows declining performance gets redesigned to highlight a more flattering metric. The model that contradicts the executive’s intuition gets retrained until it agrees.

If the data is telling you something that seems wrong, there are two possibilities. The first is that you’ve discovered a genuine insight that challenges your assumptions. This is rare and valuable. The second, far more common, is that something in your data pipeline is broken: bad joins, stale caches, misunderstood definitions, silent failures in upstream systems. Always validate. Always check your assumptions. And be deeply suspicious of any analysis that confirms exactly what you hoped it would.

4. Embedded Lies

Here’s something that keeps me up at night: data doesn’t just contain errors. It contains embedded lies. Not malicious lies, necessarily, but structural deceits built into the very fabric of what we choose to measure and how we measure it.

Consider fraud in financial services. Industry estimates suggest that only around 8% of fraud is actually reported. That means any organisation fixating on reported fraud metrics is studying the tip of an iceberg while congratulating themselves on their visibility. The dashboards look impressive. The trend lines might even be heading in the right direction. But you’re optimising for a shadow of reality.

The organisation that achieves genuine wisdom doesn’t ask “how much fraud was reported last quarter?” It asks questions like: “Who else paid money into accounts we now know were fraudulent but never reported it? What patterns preceded the fraud we caught, and where else do those patterns appear? What are we not seeing, and why?”

These questions are harder. They require linking disparate data sources, challenging comfortable assumptions, and accepting that your metrics have been lying to you. Not because anyone intended deception, but because the data only ever captured what was convenient to capture. The fraud that gets reported is the fraud that was easy to detect. The fraud that doesn’t get reported is, almost by definition, the sophisticated fraud you should actually be worried about.

5. The Illusion of Knowing Ourselves

Here’s where it gets uncomfortable. The data obsession isn’t just an organisational failure. It’s a mirror reflecting a deeper human delusion. We believe we are rational agents making deliberate, informed decisions. Neuroscience and behavioural economics have spent decades demolishing this comfortable fiction.

We are pattern matching machines running on heuristics, rationalising decisions we’ve already made unconsciously. We seek information that confirms what we already believe. We mistake correlation for causation. We see patterns in noise and miss signals in data. We are spectacularly bad at understanding our own motivations, biases, and blind spots.

This matters because organisations are collections of humans, and they inherit all our cognitive limitations while adding a few of their own. When an executive demands “more data” before making a decision, they’re often not seeking understanding. They’re seeking comfort. The data becomes a security blanket, a way to defer responsibility, a defence against future criticism. “The data told us to do it.”

But the data never tells us to do anything. We tell ourselves stories about what the data means, filtered through our assumptions, our incentives, and our fears. Without self knowledge, without understanding our own biases and limitations, more data simply gives us more raw material for self deception.

6. The Famine Amidst Plenty

We are living through a peculiar paradox: a famine of wisdom amidst a gluttony of data. We have more information than any civilisation in history and arguably less capacity to make sense of it. The problem isn’t access. It’s digestion.

Consider how we’ve changed the way we consume information. Twenty years ago, reading a book or a longform article was normal. Today, we scroll through endless feeds, consuming fragments, never staying with any idea long enough to truly understand it. We’ve optimised for breadth at the expense of depth, for novelty at the expense of comprehension, for reaction at the expense of reflection.

Organisations have mirrored this dysfunction. The average executive receives hundreds of emails daily, sits through back to back meetings, and is expected to make consequential decisions in the gaps between. They have access to realtime dashboards showing every conceivable metric, yet they lack the time and mental space to think deeply about any of them. The tyranny of the urgent crowds out the importance of the significant.

Wisdom requires time. It requires sitting with uncertainty. It requires the humility to admit what we don’t know and the patience to discover it properly. None of these things scale. None of them show up on a dashboard. None of them impress investors or boards.

7. What Organisations Should Actually Do

If data is indeed the new oil, then we need to think like refineries, not like hoarders. This means fundamental changes in how we approach information.

First, ruthlessly prioritise. Not all data deserves collection, storage, or analysis. The question isn’t “can we capture this?” but “does this help us make better decisions about things that actually matter?” Most organisations would benefit from capturing less data, not more, but capturing the right data with much greater intentionality.

Second, drain the swamp before building the lake. If you can’t trust your existing data, adding more won’t help. Invest in data quality, in clear ownership, in documentation that actually gets maintained. A small, clean, well understood dataset is infinitely more valuable than a vast murky swamp where nobody knows what’s true.

Third, invest in the refinement stages. For every pound spent on data infrastructure, organisations should be spending at least as much on the human capabilities to interpret it: skilled analysts, yes, but also domain experts who understand context, and experienced leaders who can exercise judgement. The bottleneck is rarely data. It’s the capacity to transform data into actionable understanding.

Fourth, build validation into everything. Assume your data is lying to you until proven otherwise. Cross reference. Sanity check. Ask “what would have to be true for this number to be correct?” and then verify those preconditions. Create a culture where questioning data is rewarded, not punished.

Fifth, ask the questions your data can’t answer. The most important insights often live in the gaps. What aren’t you measuring? What can’t you see? If only 8% of fraud is reported, what does the other 92% look like? These questions require imagination and domain expertise, not just better analytics.

Sixth, create space for reflection. Wisdom doesn’t emerge from realtime dashboards or daily standups. It emerges from stepping back, asking deeper questions, and allowing insights to crystallise over time. This is profoundly countercultural in most organisations, which reward visible activity over invisible thinking. But the most consequential decisions (strategy, culture, longterm investments) require exactly this kind of slow, deliberate cognition.

Seventh, institutionalise self awareness. This might sound soft, but it’s absolutely critical. Decisions made from a place of self knowledge, understanding why we want what we want, recognising our biases, acknowledging our blind spots, are categorically different from decisions made in ignorance of our own psychology. Build in mechanisms that surface assumptions, challenge groupthink, and create psychological safety for dissent.

Eighth, measure what matters. The easiest things to measure are rarely the most important. Clicks are easier to count than customer trust. Output is easier to measure than outcomes. Activity is easier to track than impact. The discipline of identifying what actually matters, and accepting that some of it may resist quantification, is essential to breaking free from data theatre.

8. Decisions From a Place of Knowing

The goal isn’t to reject data. That would be as foolish as rejecting evidence. The goal is to put data in its proper place: as one input among many, useful but not sufficient, informative but not determinative.

The best decisions I’ve witnessed, the ones that created genuine value, that navigated genuine uncertainty, that proved robust in the face of changing circumstances, didn’t come from better dashboards. They came from leaders who understood themselves well enough to know when they were rationalising versus reasoning, who had cultivated judgement through experience and reflection, and who treated data as a conversation partner rather than an oracle.

This kind of wisdom is slow to develop and impossible to automate. It requires exactly the kind of patient, deep work that our information saturated environment makes increasingly difficult. But it remains the essential ingredient that separates organisations that thrive from those that merely survive.

9. Conclusion: From Gluttony to Nourishment

Data is indeed the new oil. Which means it’s messy, it’s dangerous, and in its raw form, it’s nearly useless. It stinks. It requires enormous effort to extract. It demands sophisticated infrastructure and genuine expertise to refine. And like oil, its careless use creates pollution: in this case, pollution of our decisionmaking, our organisations, and our understanding of ourselves.

The organisations that will win the next decade aren’t the ones with the biggest data lakes, or swamps. They’re not the ones with the fanciest analytics platforms or the most impressive dashboards. They’re the ones that recognise the difference between information and understanding, between metrics and meaning, between data and wisdom.

They’ll be the organisations that ask hard questions about what their data isn’t showing them. That validate relentlessly rather than trust blindly. That understand tortured data will confess to anything and refuse to torture it. That recognise the embedded lies in their measurements and actively hunt for what they’re missing.

Most importantly, they’ll be organisations led by people who know themselves. Who understand their own biases, who can distinguish between reasoning and rationalising, who have the humility to admit uncertainty and the patience to sit with it. Because in the end, the quality of our decisions cannot exceed the quality of our self knowledge.

The famine won’t end by consuming more data. It will end when we learn to digest what we already have: slowly, carefully, wisely. When we stop mistaking the swamp for a lake, the noise for a signal, and the comfortable lie for the inconvenient truth.

The first step in that transformation is the hardest one of all: admitting that we don’t know nearly as much as we think we do. Not about our customers, not about our markets, and certainly not about ourselves.

The famine won’t end until we stop gorging and start digesting.