The AI Productivity Bill Comes Due in Production
AI makes code cheaper. It does not make weak ideas useful or production work free.
The easiest place for an AI rollout to look successful is the velocity dashboard. Pull request count is up. Cycle time improves. More code gets merged. The tool has a tidy story to tell.
Production usually tells the longer version. The review queue gets heavier. The same two senior engineers become the validation layer for a larger volume of plausible patches. Support sees more small changes with surprising edge cases. The team ships more often, but on-call starts to feel more expensive. None of this proves the AI rollout failed. It proves the dashboard stopped too early.
That is the standard I would use for this debate: AI can make code cheaper. It cannot make a weak idea useful, and it cannot make production work free.
The AI productivity conversation is still too comfortable measuring the part of software work that AI makes easiest to see. Lines of code, pull requests, story points, and deployment frequency are all close to production of work. They are not the same as value. A team can generate more code, open more pull requests, ship more frequently, and still leave customers with worse software and engineers with a more fragile system to operate.
I do not care very much whether AI helped a team produce more lines of code. Story points were already a weak proxy before implementation became cheaper. Deployment frequency is useful as a delivery capability signal, but it is not a product outcome. Even feature count can lie if the team is shipping work customers do not need. The harsher question is whether the team shipped something that made sense for users and whether the delivery system absorbed the change without pushing hidden cost into review, incidents, support, security, or maintenance.
If the answer is no, the team did not become more productive. It found a “cheaper” way to create activity.
The old cost was a filter
Dax Raad, who created OpenCode, made a useful version of this argument in a February 2026 post that Business Insider covered. His point was not that AI coding tools are useless. It was that code production was often not the real constraint. In one sharp line, he wrote that “ideas being expensive to implement was actually helping.”
That is uncomfortable because it names something engineering organizations do not like to admit. Implementation cost was a filter. Not a fair filter, not always a good filter, and often a frustrating one. But it forced some ideas to die before they became roadmap commitments, support obligations, security surfaces, and half-owned production behavior.
When implementation gets cheaper, weak ideas can travel farther before anyone feels the cost. The organization can build more experiments, more variants, more internal tools, more half-promised features, and more code that nobody is quite ready to own. Some of that is good. AI should make useful neglected work cheaper: tests, documentation, migrations, cleanup, internal tooling, and prototypes that were not worth a full planning cycle. But lower implementation cost also lowers the friction that used to make teams ask whether the idea deserved implementation at all.
The bill does not disappear. It moves downstream.
A shallow AI rollout makes code generation faster and then asks the same review, testing, security, release, and support systems to absorb the extra work. That is where the cost shows up: not when the patch compiles, and not when the dashboard celebrates a larger pile of merged changes, but when the organization has to review, ship, operate, recover from, and maintain the work it just made easier to produce.
Harness has a timely but imperfect signal here. Its 2026 State of DevOps Modernization report is vendor research, so I would not treat it as neutral proof. Still, the pattern is worth attention. Harness surveyed 700 engineering practitioners and managers in large enterprises in February 2026. Very frequent AI coding users were more likely to report daily or faster deployments: 45%, compared with 15% of occasional AI coding users. But 69% of that same very-frequent group said AI-generated code leads to deployment problems at least half the time. The report also found higher reported rollback, hotfix, or customer-impacting incident rates among very frequent users, and longer mean time to recovery for production incidents related to code deployments.
Harness explicitly says there is no causal proof that AI coding caused those problems. That caveat matters. The teams using AI heavily may already be under more delivery pressure, may already deploy more often, or may already have weaker delivery systems. But the caveat does not rescue the easy productivity story. If AI is adopted inside a system that cannot absorb more change, local coding speed becomes a stress multiplier. A good tool can still be deployed into an unready system.
Stack Overflow’s May 27, 2026 pulse survey points in the same direction from a different angle. Among 1,100 respondents, workplace agent use had risen to 59%, but 63% said they rarely or never let agents run entirely on autopilot. Accuracy and security remained the top concerns. This is not an argument that agents have failed. It is evidence that serious teams still treat production AI work as supervised work.
Google’s 2025 DORA AI-assisted software development report uses the right frame: successful AI adoption is a systems problem, not a tools problem. The report also says value stream management should help local productivity gains turn into product performance instead of downstream chaos. That is the sentence I would put on the wall before any AI productivity review. DORA metrics can tell you whether the delivery system is getting healthier or sicker. They cannot tell you whether the thing you shipped should exist. They are guardrails for delivery health, not proof of product value.
The new work is supervision
The bottleneck does not only move. The work changes shape.
A May 2026 longitudinal study of professional software engineers found that AI coding assistants shifted work away from creation and toward verification. Participants reported spending less time on most development tasks, including 82% who reported spending less time writing code. The authors call the new category “supervisory engineering work”: directing, evaluating, and correcting AI output. They also found a productivity-experience paradox. Self-reported productivity improvements stayed high, but among matched participants, the share reporting worse developer experience in at least one dimension nearly doubled from 14% to 27%.
That matches the texture many senior engineers recognize. AI does not remove judgment. It moves judgment to a different place and then sends more material to that place.
Code review is where this becomes obvious. A 2026 vision paper on code review in the age of AI argues that AI coding assistants increase code production velocity while expanding the volume of code requiring review, turning review into a growing bottleneck unless the workflow around it changes. The paper is a research agenda rather than an outcome study, so I would not cite it as proof that every team is seeing this. But the direction is right. If the organization makes code cheaper and keeps review mostly manual, review becomes the place where the productivity story has to pay rent.
That is especially true for senior engineers. A team can celebrate more output while concentrating more validation work on the people least able to absorb it. Those engineers already carry architectural memory, production intuition, incident scar tissue, and the informal taste that keeps a codebase from turning into a pile of plausible patches. If AI increases the amount of code that needs their judgment, the organization may be spending its rarest capacity faster than before.
Siddhant Khare, who builds agent infrastructure, described the human version of this shift in his February 2026 essay on AI fatigue: “AI reduces the cost of production but increases the cost of coordination, review, and decision-making.” That is the part most productivity dashboards miss. The code may arrive faster, but the decisions around the code do not become free. Someone still has to decide whether the output is correct, safe, maintainable, aligned with the architecture, worth shipping, and worth owning.
This is why I distrust most AI productivity metrics. They usually stop counting at the point where AI looks best.
Measure past the merge
The mistake is stopping measurement where AI looks best. The useful question is what the system has to absorb after the code exists.
Lines of code, pull request counts, story points, and even cycle time can create the appearance of progress while missing the harder question: did customers get better software, and did the delivery system stay healthy? If cycle time improves because reviews get thinner, tests get weaker, or teams ship work users do not care about, the metric did its job poorly. It measured motion and missed value.
Harness’s 2026 State of Engineering Excellence report makes this measurement problem explicit, again with the same vendor-evidence caveat. Its survey says 81% of engineering leaders report increased code review time since deploying AI, 31% of a developer’s day is now consumed by AI-related invisible work, and 94% say technical debt, validation time, and developer burnout are missing from current metrics. I would not build a strategy from those exact numbers. I would take the pattern seriously: organizations are measuring output more readily than they are measuring the effort required to make that output safe and useful.
METR’s February 2026 update on developer productivity measurement is useful for the same reason. METR’s earlier randomized study found that experienced open-source developers were slower with early-2025 AI tools, but its 2026 update says the next experiment design became hard to interpret as AI adoption changed participation, task selection, quality choices, and time reporting. Developers selected different tasks when AI was allowed, changed how much documentation or testing they produced, and sometimes worked on other things while agents ran. That does not make measurement hopeless. It means the simple before-and-after story is often the least trustworthy story.
The measurement boundary has to move downstream, toward the places where software work becomes value or turns into debt. I would want a downstream ledger with five categories.
Customer value: adoption, task completion, retention, revenue, support contacts, or whatever signal maps to the user’s life getting better.
Delivery health: lead time, change failure rate, rollback rate, mean time to recovery, flaky pipeline pain, and release interruptions.
Review quality: pull request size, review queue time, review rounds, senior reviewer concentration, and whether reviewers are being asked to validate code the author does not understand.
Maintenance cost: rework, follow-up fixes, documentation drift, dead features, duplicate code, and code that becomes hard to change a month later.
Human cost: cognitive load, on-call interruption, after-hours recovery, and whether the best engineers are spending more time supervising output than making hard technical decisions.
The exact metrics depend on the team. The principle does not: do not let the measurement stop at the point where AI looks best.
Roll it out as a production change
If I were accountable for an engineering organization adopting AI coding tools, I would not frame the rollout as “make engineers faster.” That framing points everyone at the part of the system the tool can most easily accelerate, then acts surprised when the rest of the system starts to strain. I would frame it as a delivery and product quality change.
For production work, the author still owns the change. I do not care whether the code came from a model, a snippet, a Stack Overflow answer, a generated patch, or a late-night burst of human confidence. The person merging it is responsible for understanding it. “The AI wrote it” is not a root cause. It is a sign that ownership got blurry.
I would also make the risk boundaries explicit. AI-generated tests, documentation, migrations, internal tools, and boilerplate can often move faster with lower risk. Code touching authentication, authorization, payments, data migration, concurrency, incident recovery, privacy, or customer-visible behavior should not get a lighter review because a model produced it. In some cases it deserves a heavier review, because plausible code can be harder to distrust than obviously messy code.
I would protect review capacity before celebrating output. If AI increases pull request volume, the answer is not to tell senior engineers to review faster. Shrink the changes. Require authors to explain generated code before review. Track review queue time. Give reviewers room to do the work. Make it acceptable to reject a generated patch because the author cannot explain the tradeoffs. A review culture that depends on two overloaded senior engineers was already fragile. AI just makes the fragility harder to hide.
I would measure the rollout at the team and system level, not as individual surveillance. The goal is not to rank engineers by prompt efficiency, AI acceptance rate, or commits touched by a model. That will teach people to game the tool and hide the work. The goal is to learn whether the team is delivering more customer value with equal or better production health, review quality, maintenance cost, and human sustainability.
That is a higher bar than “we shipped more code.” It is also the only bar that matters.
The useful counterargument
Some teams really are shipping more value with AI. The piece should not pretend every AI gain is fake. AI can make neglected work cheaper: tests, documentation, migrations, internal tools, API exploration, cleanup, and the boring tasks that used to lose every prioritization fight. For teams with strong product judgment, review discipline, and ownership norms, those gains can turn into better software and less operational pain.
But that does not weaken the argument. It raises the standard of proof.
If AI helps your team ship a feature customers use, with the same or better reliability, without overloading review, without hiding maintenance cost, and without turning senior engineers into cleanup infrastructure, then call it productivity. That is the kind of win I would take seriously. If AI mostly increases the number of things in motion, the number of pull requests waiting for review, the number of features nobody asked for, or the amount of code nobody fully understands, then the team has not discovered leverage. It has discovered a cheaper way to create inventory.
Software teams already had too much inventory: backlogs full of maybe-work, roadmaps full of stakeholder theater, codebases full of half-owned decisions, and dashboards full of metrics that make motion look like value. AI does not forgive those habits. It compounds them.
Users pay the invoice
The title says the bill comes due in production, and that is true. Production is where weak assumptions stop being private. It is where review gaps, unclear ownership, fragile rollouts, missing tests, and confused product decisions become visible.
But production is not the final customer. Users are.
A production system can survive a change that still should not have been built. A team can deploy safely while wasting months on features that do not improve the product. AI productivity measured only at the engineering boundary will miss that failure.
The sharp version of the claim is this: AI productivity that does not reach users as better software is not productivity. It is cheaper throughput inside the factory.
That does not make AI bad, and it does not mean teams should avoid the tools. Teams should use them and push them hard. Let engineers automate the boring parts. Let agents draft, test, explain, migrate, and explore. But do not let the organization pretend that faster code is the same as better software.
The bill comes due in production, but the invoice is paid by customers. If they do not get a better product, and the team does not get a healthier delivery system, the productivity story is just a nicer dashboard wrapped around more work.


