👁3views
The Dark Factory: Why Most Teams Are Getting Slower While a Few Are Building Software Without Any Humans

CloudScale AI SEO - Article Summary
  • 1.
    What it is
    A growing divide separates a handful of AI-native teams that have fully automated software production from the majority of developers who are actually getting slower when using AI tools, despite believing the opposite.
  • 2.
    Why it matters
    Most developers and managers are operating on false assumptions about AI's productivity impact, which means teams are making costly strategic decisions—adopting tools, restructuring workflows, writing press releases—based on self-reported gains that controlled studies show are often imaginary.
  • 3.
    Key takeaway
    The gap between developers who are genuinely faster with AI and those who are measurably slower comes down to which 'level' of AI delegation they've mastered, not whether they use AI at all.

What the gap between frontier AI teams and everyone else actually means for developers, managers, and organisations in 2026

1. The Paradox Nobody Is Talking About Honestly

Ninety percent of Claude Code’s codebase was written by Claude Code itself. Boris Churny, the engineer who leads the Claude Code project at Anthropic, has not personally written code in months. Anthropic’s leadership is now estimating that functionally one hundred percent of all code produced at the company is AI generated. OpenAI’s Codex 5.3 is the first frontier model that was instrumental in building itself — earlier versions analyzed training logs, flagged failing tests, and suggested fixes to their own training scripts. The feedback loop on AI assisted software development has closed.

At the same time, in the same industry, on the same planet, a rigorous 2025 randomized controlled trial by METR found that experienced open source developers using AI tools took nineteen percent longer to complete tasks than developers working without them [1]. The researchers recruited sixteen developers from large repositories averaging more than one million lines of code, controlled for task difficulty, developer experience, and tool familiarity, and none of it mattered. AI made even experienced developers measurably slower. And here is the part that should really unsettle you: before the study began, those same developers predicted AI would make them twenty four percent faster. After completing it, they estimated they had been twenty percent faster. They were wrong not just about the direction of the change but about its magnitude [1].

Three teams are running what are being called lights out software factories. The rest of the industry is getting measurably slower while convincing themselves and their leadership with press releases that they are speeding up. The distance between these two realities is the most important gap in technology right now, and almost nobody is talking honestly about what it takes to cross it. That said, the METR study itself acknowledges a sampling problem that undercuts the strength of its slowdown finding. By the time METR attempted a follow on study in late 2025, a significant share of developers refused to participate because they were unwilling to work without AI tools, even for pay [11]. The developers most convinced that AI accelerates them are now systematically absent from the data. METR’s own researchers believe the true productivity effect is probably higher than their original estimate captured, because the study is now missing the developers who gain most from the tools [11]. The nineteen percent slowdown is a real finding for a specific population doing a specific kind of task in early 2025. It is not a verdict on AI assisted development as a category.

2. The Five Levels of Vibe Coding

Dan Shapiro, CEO of Glowforge and Wharton Research Fellow, published a framework in January 2026 that maps where the industry actually stands [2]. He calls it the five levels of vibe coding, and the name is deliberately informal because the underlying reality is what matters. The framework is modeled on the automotive industry’s five levels of driving automation, because the human role recedes at each stage in a structurally similar way.

Level zero is spicy autocomplete. You type the code, the AI suggests the next line, you accept or reject. This is GitHub Copilot in its original format — a faster tab key. The human is really writing the software. The AI is just reducing the keystrokes. Level one is the coding intern. You hand the AI a discrete, well scoped task and review everything that comes back. The AI handles the task. The human handles the architecture, the judgment, and the integration. Level two is the junior developer. The AI handles multifile changes, navigates a codebase, understands dependencies, and builds features that span multiple modules. You are reviewing more complicated output, but you as a human are still reading all the code. Shapiro estimates that ninety percent of developers who describe themselves as AI native are operating at this level [2]. Level three is where the relationship starts to flip: the developer is now the manager, directing the AI and reviewing output at the PR level rather than writing any code themselves. Almost everybody tops out here right now because of the psychological difficulty of letting go of the code [2]. Level four is the developer as product manager — you write a specification, leave, come back hours later and check whether the tests pass. Level five is the dark factory. A black box that turns specifications into software. No human writes the code. No human reviews the code. The factory runs autonomously with the lights off — the name comes from manufacturing facilities run entirely by robots, dark because there is no need for lights when nobody is inside [2].

The levels framework is clean and persuasive, but it may flatten a crucial distinction. Most software is not being written for a new product on a greenfield codebase. The overwhelming majority of professional coding involves modifying, debugging, and extending systems with years of accumulated context. A framework built around the experience of a small team building a new product from scratch tells you something real about the frontier, but it systematically overstates how close the rest of the industry is to those transitions. The developer who reaches level three on a fresh project and the developer who reaches level three inside a fifteen year old financial services codebase with undocumented dependencies are not in the same situation. The levels model also does not reckon with what gets lost when humans stop reading the code: not just bug catching, but accumulated understanding of the system, the institutional knowledge of why things were built the way they were. That loss may not show up in a satisfaction score.

3. Inside a Real Dark Factory: StrongDM’s Software Factory

StrongDM’s software factory is the most thoroughly documented example of level five software development in production. Simon Willison, one of the most careful and credible observers in the developer tooling space, visited the team in October 2025 and described it as the most ambitious form of AI assisted software development he had seen yet [4]. It is worth noting that StrongDM does not build internal tools or toy applications. It builds access management and security software for enterprises — infrastructure that controls who can touch what across Okta, Jira, Slack, Google Drive, and more.

The team is three people: Justin McCarthy, Jay Taylor, and Navan Chauhan. They founded the StrongDM AI team on July 14, 2025, and published their full methodology publicly in February 2026 [3]. The inflection point they identify is the second revision of Claude 3.5 Sonnet, which shipped in October 2024. That is when, in their words, long horizon agentic coding workflows began to compound correctness rather than error [3]. The factory runs on an open source coding agent called Attractor. The repository is just three markdown specification files. The two founding rules of the factory are that code must not be written by humans and code must not be reviewed by humans. The guiding question for every engineer on the team is a single prompt: why am I doing this? The implication is always the same — the model should be doing it instead [3].

Their approach to testing is where most people’s mental models start to break down. StrongDM does not use traditional software tests. They use scenarios: end to end user stories stored outside the codebase, invisible to the code generating agents, evaluated by an LLM acting as judge. The team discovered that agents given access to their own tests would write return true and declare victory [3]. Scenarios, stored as holdout sets in the machine learning sense, prevent this. Success is measured not as a boolean pass or fail but as a satisfaction rate: what fraction of observed user trajectories through all scenarios likely satisfied the user [3]? The other major architectural piece is the Digital Twin Universe: behavioral clones of every external service the software integrates with — full replicas of Okta, Jira, Slack, Google Docs, Google Drive, and Google Sheets — running locally with no rate limits and no production risk [3]. The output is real. CXDB, their AI context store, has sixteen thousand lines of Rust, nine and a half thousand lines of Go, and six thousand seven hundred lines of TypeScript, shipped and in production [4].

Several serious criticisms of the StrongDM model have emerged from practitioners who have studied it closely, and they deserve equal weight. The first is code quality beneath the surface. A technical review of CXDB’s Rust code noted heavy reliance on patterns that typically signal an agent fighting the language’s ownership model rather than designing it properly [12]. The code passes the scenarios, but the internal architecture may be accumulating invisible debt that will only surface when performance degrades or a subtle concurrency bug hits production. The second is the cost of the Digital Twin Universe. Building faithful replicas of Slack, Okta, Jira, and Google’s suite is an enormous engineering investment. For three experienced engineers who know these APIs deeply, it was achievable. For a larger organisation with dozens of integrations and a rotating team, twin maintenance quickly becomes a new bottleneck rather than a solved problem [13]. The third is accountability. Stanford Law’s CodeX Center asked the question nobody in the hype cycle wants to confront: when no human has read the code, who is liable when it fails [10]? StrongDM builds security software. If an access management flaw leaks credentials because an agent introduced a subtle privilege escalation that no human ever saw, existing liability frameworks have no clear answer for who bears responsibility. The fourth is competitive moat erosion. Willison himself noted that if any competitor can clone your newest features with a few hours of coding agent work, the economics of the dark factory raise an uncomfortable question about what sustainable competitive advantage looks like when implementation becomes trivially reproducible [4].

4. The Self Referential Loop at the Frontier

Codex 5.3 is the first frontier model that was instrumental in creating itself. Earlier builds analyzed training logs, flagged failing tests, and suggested fixes to their own training scripts. This model shipped as a direct product of its own predecessors’ coding labour. OpenAI reported a twenty five percent speed improvement and ninety three percent fewer wasted tokens in building Codex 5.3, and those improvements came in part from the model identifying its own inefficiencies during its own construction. Claude Code is doing something similar. Ninety percent of the code in Claude Code was built by Claude Code itself. Four percent of all public commits on GitHub are now directly authored by Claude Code — a number Anthropic expects will exceed twenty percent by the end of 2026 [5]. Claude Code hit a billion dollar run rate just six months after launch.

The self referential framing is striking, but it conflates two meaningfully different things. A model contributing to its own training pipeline by flagging test failures is not the same as a model autonomously designing its own architecture. The improvements in Codex 5.3 came from predecessor models doing bounded, well defined tasks with humans making the architectural and strategic decisions throughout. The loop is real, but it is a supervised loop with humans firmly in control of what matters most. Describing it as AI building itself risks creating a mental image of recursive self improvement without limit, which is not what is actually happening. The more honest framing is that AI is becoming a useful contributor to the development of AI tooling, in the same way that any good programming tool eventually gets used to improve itself. That is genuinely significant. It is not the same as unbounded autonomous self improvement.

5. Why Most Teams Are Getting Slower

When you bolt an AI coding assistant onto an existing workflow without redesigning that workflow around the tool, productivity dips before it gets better. Developers spend time evaluating AI suggestions, correcting almost right code, context switching between their own mental model and the model’s output, and debugging subtle errors introduced by generated code that looked correct but was not. Google’s 2024 DORA report found that every twenty five percent increase in AI adoption correlated with a one point five percent dip in delivery speed and a seven point two percent drop in system stability [6]. One senior engineer put it sharply: Copilot makes writing code cheaper but owning it more expensive. The organisations seeing genuine productivity gains of twenty five percent or more are not the ones that installed Copilot and called it done. They are the ones that went back to the whiteboard and redesigned their entire development workflow around AI capabilities — changing how they write specs, how they review code, what they expect from junior versus senior engineers, and how their CI/CD pipelines catch the new category of errors that AI generated code introduces.

The J curve narrative is intuitively appealing but risks becoming unfalsifiable. Any organisation that installs AI tools and gets slower can be told it just has not yet redesigned its workflow sufficiently. Any organisation that gets faster can be cited as evidence the theory is right. There is a real possibility that the slowdown is not a transitional dip but a persistent feature of certain kinds of work — particularly the deep, context saturated work of maintaining large existing codebases — and that no amount of workflow redesign will change that. METR’s own analysis identified that developers who struggled most were working on mature codebases where they had five or more years of experience and deep implicit knowledge [1]. For those developers, the AI is not filling a gap. It is adding noise to a system that was already working well. The evidence that end to end process transformation reliably produces twenty five percent gains is largely anecdotal and comes primarily from organisations with a vested interest in reporting positive results.

6. The Organisational Structures That No Longer Make Sense

Every process in a software organisation was designed to address a human limitation. Stand ups exist because developers need to synchronise. Sprint planning exists because humans can only hold a finite number of tasks in working memory. Code review exists because humans make mistakes that other humans can catch. When the human is no longer the one writing the code, these structures are not optional overhead. They are friction. StrongDM’s three person team does not have sprints. They do not have standups. They do not have a Jira board. They write specs and they evaluate outcomes. The entire coordination layer that most engineering managers spend sixty percent of their time maintaining does not exist — not because it was cut for cost, but because it no longer serves a purpose [3]. The skills that matter are shifting from coordination to articulation. From making sure people are rowing in the same direction to making sure the direction is described precisely enough that machines can execute it. The bottleneck has moved from implementation speed to specification quality.

The claim that coordination structures become pure friction once AI does the implementation assumes that specification writing is itself uncoordinated — that one person can write a spec in isolation and hand it to agents without the alignment work that standups and planning meetings currently provide. In practice, specifications for non trivial software encode dozens of implicit decisions about trade offs, priorities, customer needs, and technical constraints. Those decisions currently get resolved through conversation, through the back and forth of planning meetings, through a senior engineer asking whether you meant X or Y. When the agent cannot ask that question, the burden does not disappear. It migrates into the specification itself, which means someone has to do the alignment work before writing the spec. Whether that is faster or slower than the coordination overhead it replaces is an open empirical question that nobody has answered rigorously yet. The dark factory may simply relocate the coordination tax rather than eliminate it.

7. The Legacy System Problem

The vast majority of enterprise software is brownfield. Monoliths grown organically through fifteen years of feature additions. CI/CD pipelines tuned to the quirks of a specific codebase. Configuration that exists in the heads of the three people who have been at the company long enough to remember why that one environment variable is set to that particular value. You cannot dark factory your way through a legacy system. The specification for it does not exist. The tests, if there are any, cover thirty percent of the codebase, and the remaining seventy percent runs on institutional knowledge, tribal lore, and someone who shows up once a week and knows where all the skeletons are buried. The system itself is the specification. For most organisations, the path to the dark factory begins with developing a specification for what their existing software actually does, which is deeply human work requiring the engineer who knows why the billing module has that one edge case for Canadian customers, the architect who remembers which microservice was carved out of the monolith under duress during the 2021 outage, and the product person who can explain what the software actually does for real users versus what the PRD says it does.

This is the strongest check on the dark factory narrative, but it also risks being used as a reason to do nothing rather than a guide to where to start. The existence of legacy complexity does not mean that AI is useless on legacy systems. It means the entry point is different. There is already significant evidence that AI tools are highly effective at generating documentation from existing code, explaining unfamiliar modules, writing tests for untested functions, and identifying dead code. None of that requires a complete specification of the system. Teams that are waiting for perfect specifications before attempting AI adoption are likely waiting forever. The more productive framing is that AI can help you build the specification incrementally, module by module, as you deepen your understanding of a system — rather than treating the specification as a prerequisite that must fully exist before AI can add any value at all.

8. The Talent Reckoning

A 2025 Harvard study tracking sixty two million workers across two hundred and eighty five thousand US firms found that when companies begin using generative AI, junior employment drops by nine to ten percent within six quarters of implementation, while senior employment remains virtually unchanged [7]. In the UK, tech graduate roles fell forty six percent in 2024, with projections pointing to a further fifty three percent drop by 2026 [8]. In the US, junior developer job postings have declined by sixty seven percent [8]. The career ladder in software engineering has always functioned as an apprenticeship model. Juniors learn by doing: writing simple features, fixing small bugs, absorbing a codebase through immersion. AI breaks that model at the bottom. If AI handles the simple features and the small bug fixes, the career ladder gets hollowed out from underneath: seniors at the top, AI at the bottom, and a thinning middle where learning used to happen. And yet we need more excellent engineers than we have ever needed, not fewer. Adequate is no longer a viable career position regardless of seniority, because adequate is what the models do.

The talent reckoning narrative is real but potentially self defeating if it discourages exactly the investments that would help. The historical pattern from every previous labour market transition involving automation is that the jobs that disappear are not the only jobs that matter. The automobile industry eliminated farriers and stable hands and created mechanical engineers, road builders, urban planners, and logistics specialists at far larger scale. Former GitHub CEO Thomas Dohmke has argued that junior developers who are AI native from the start are not the victims of this transition but its likely beneficiaries, because they will arrive in the workforce with skills that senior engineers spent years trying to acquire [14]. That argument does not help the people currently frozen out of entry level jobs, but it does complicate the picture of a profession in straightforward decline. The more honest framing is that the transition is real and the pain is unevenly distributed — falling hardest on those who trained for a version of the profession that is contracting fastest — without that implying the profession itself is in terminal decline.

9. What AI Native Organisations Actually Look Like

Cursor has passed half a billion dollars in annual recurring revenue with a team of a few dozen employees — roughly three and a half million dollars in revenue per employee, compared to an industry average of around six hundred thousand for SaaS companies. The top ten AI native startups are averaging around three million in revenue per employee, between five and six times the SaaS industry average. These are small groups of people who are exceptionally good at understanding what users need, at translating that understanding into clear specification, and at directing AI systems that handle the implementation. The org chart is flattening radically. The coordination layers that exist to manage hundreds of engineers can be deleted when the engineering is done by agents. The only people who remain are the ones whose judgment cannot be automated.

The revenue per employee numbers are real, but they require careful interpretation. Cursor and Midjourney are developer tools and creative platforms — software products with extraordinarily efficient distribution and near zero marginal cost of serving each additional user. They are not representative of the full software economy. A regional bank, a healthcare system, a government agency, or a manufacturing company cannot simply replicate the Cursor business model by adopting AI coding tools, because their value does not come from software distribution efficiency. It comes from regulated relationships, physical infrastructure, institutional trust, and operational complexity that AI cannot compress. The AI native startup numbers are an existence proof that extremely small teams can generate enormous revenue with AI, but extrapolating from that to a claim that all software organisations will look like this is a category error. The companies that will shrink most dramatically are those building commodity software. The ones operating in regulated, relationship dependent, or operationally complex domains will face a very different transition, with different timelines and different constraints.

10. The Demand Horizon

Every previous transition that dropped the cost of computing exploded total demand for software rather than flattening it. The cloud did not just make existing software cheaper. It created SaaS, mobile applications, streaming, real time analytics, and dozens of other categories that could not previously exist. The same dynamic applies now. A regional hospital, a mid market manufacturer, a family logistics company — organisations that genuinely need custom software but have never been able to afford it — are suddenly in the addressable market. The constraint moves from can we build it to should we build it. The dark factory does not replace the people who can answer that question well. It amplifies them. It turns a great product thinker with five engineers into a great product thinker with effectively unlimited engineering capacity.

The demand explosion argument is the standard rebuttal to automation concerns and it deserves genuine scrutiny. The historical pattern of technology creating more jobs than it destroys has held across transitions that took decades, during which labour markets had time to adjust, retraining could happen generationally, and the new categories of work were geographically co located with the old ones. The current transition is compressing those timescales dramatically. If AI capabilities advance as fast as the trajectory of the past two years suggests, the new demand for software in previously unaddressable markets may materialise faster than the workforce can retrain to serve it. The jobs created may require fundamentally different skills from the jobs destroyed, in different cities, accessible to different demographics. That the total eventually balances in aggregate is cold comfort to the specific person whose specific skills have been automated in the specific decade of their working life.

11. The Honest Summary

The dark factory is real. A small number of teams are producing production software without any humans writing or reviewing code. The tools are building themselves. Those teams are getting faster and faster. Most companies are not there. They are stuck at level two, getting measurably slower with AI tools they believe are making them faster, and running organisational structures designed for a world where humans do all the implementation work. The distance between them is not primarily a technology gap. It is a people gap, a culture gap, an organisational gap, and a willingness to change gap that no tool and no vendor can close.

But the counter argument that runs through every section of this article also needs to be said plainly. The dark factory narrative, told without qualification, risks replacing one form of hype with another. StrongDM is three unusually capable engineers who built a greenfield system in a domain they understood deeply, with a token budget most organisations cannot sustain, in a way that raises unresolved questions about code quality, accountability, and competitive durability. The METR slowdown study has real sampling limitations that its own authors acknowledge. The revenue per employee numbers come from a narrow class of software businesses. The demand explosion has historical precedent but no guarantee. The talent transition is real and painful and not automatically self correcting.

The most honest position is that both things are true simultaneously and uncomfortably. The dark factory is a genuine, working glimpse of where software development is going. And the distance between that glimpse and where most organisations actually are is not closing as fast as either the optimists or the pessimists tend to claim. Progress is real. It is also uneven, slower in brownfield environments than in greenfield ones, dependent on organisational change that is hard and slow and political, and nowhere near as universal as the frontier teams make it look. The dark factory does not need more engineers. It desperately needs better ones. And better means people who can think clearly about what should exist, describe it precisely enough that machines can build it, and evaluate whether what got built actually serves the real humans it was built for. That has always been the hard part of software engineering. The machines have stripped away the camouflage of implementation complexity. We are all about to find out how good we really are at the part that was always hardest.

References

[1] Becker, J., Rush, N., Barnes, E., and Rein, D. (2025). Measuring the Impact of Early 2025 AI on Experienced Open Source Developer Productivity. METR. arXiv:2507.09089. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/

[2] Shapiro, D. (2026, January 23). The Five Levels: From Spicy Autocomplete to the Dark Factory. danshapiro.com. https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory/

[3] McCarthy, J. (2026, February 6). Software Factory Manifesto. StrongDM. https://factory.strongdm.ai/

[4] Willison, S. (2026, February 7). How StrongDM’s AI team build serious software without even looking at the code. simonwillison.net. https://simonwillison.net/2026/Feb/7/software-factory/

[5] Anthropic (2026). Claude Code product metrics and GitHub commit data. Cited in multiple industry reports, February 2026.

[6] Google DORA Research Program (2024). Accelerate State of DevOps Report 2024. Google Cloud. https://dora.dev/research/2024/

[7] Hosseini, S. M. and Lichtinger, G. (2025). Seniority Biased Change: AI Adoption and Junior Employment. Harvard University. Cited in Fortune, September 2025. https://fortune.com/2025/09/04/ai-entry-level-jobs-uncertainty-college-grads/

[8] Rezi (2026, January). The Crisis of Entry Level Labor in the Age of AI 2024 to 2026. https://www.rezi.ai/posts/entry-level-jobs-and-ai-2026-report

[9] Stanford Digital Economy Lab (2025). Digital Economy Study: Employment Trends in AI Exposed Occupations. Cited in Stack Overflow Blog, December 2025. https://stackoverflow.blog/2025/12/26/ai-vs-gen-z/

[10] Stanford Law School CodeX Center (2026, February 8). Built by Agents, Tested by Agents, Trusted by Whom? https://law.stanford.edu/2026/02/08/built-by-agents-tested-by-agents-trusted-by-whom/

[11] METR (2026, February 24). We are Changing our Developer Productivity Experiment Design. https://metr.org/blog/2026-02-24-uplift-update/

[12] Polyglot Factotum (2026, February). Dark Factory AI Review: Innovation or Slop? Medium. https://medium.com/@polyglot_factotum/slop-review-with-ai-the-dark-factory-ffca22406822

[13] Infralovers (2026, February 22). Dark Factory Architecture: How Level 4 Actually Works. https://www.infralovers.com/blog/2026-02-22-architektur-patterns-dark-factory/

[14] Codeconductor.ai (2026, January). Junior Developers in the Age of AI: Future of Entry Level Software Engineers. https://codeconductor.ai/blog/future-of-junior-developers-ai/