Tokenmaxxing Is The Dumbest Metric In Tech Right Now

Counting tokens is the new lines-of-code, and engineering leadership keeps falling for it.

Apr 26, 2026

“Deeply alarmed.” That’s how NVIDIA’s Jensen Huang said he’d feel, at GTC in March, about any $500,000-per-year engineer who wasn’t burning at least $250,000 worth of AI tokens to do their job.

I manage engineers. Huang is wrong about this, and the handful of CTOs echoing him publicly probably know it. Token consumption isn’t a measure of engineering productivity. It’s among the worst input metrics the industry has reached for in a generation, and it’s spreading fast.

A dashboard that shouldn’t have existed

Earlier this month, an engineer at Meta built an internal leaderboard, called Claudeonomics, that ranked all 85,000-plus Meta employees by AI token consumption. The Information broke the story. Over a 30-day stretch, Meta employees had collectively burned more than 60 trillion tokens. The leaderboard gamified the spend with titles like Token Legend, Session Immortal, and Cache Wizard. The top single user averaged 281 billion tokens over the month. Mark Zuckerberg didn’t crack the top 250. Neither did CTO Andrew Bosworth. Within a couple of days of The Information publishing, Meta took the dashboard down.

At Anthropic’s public Opus pricing, 60 trillion tokens comes out to roughly $900M for the month. Meta is almost certainly buying at a discount, and Gergely Orosz at The Pragmatic Engineer has estimated the real bill is more likely north of $100M. Even the discount number is a lot of money to pay for a leaderboard that had to be taken down.

This isn’t one weird dashboard. It’s a pattern.

Bosworth said in February that a top engineer spending the equivalent of their salary on tokens was delivering 10x output, and framed it as a no-brainer with no upper limit. Meta’s Chief People Officer, Janelle Gale, has told staff that “AI-driven impact” will be a core expectation in 2026, the same year the company overhauled performance reviews to push top-performer bonuses as high as 200%. Microsoft has run its own internal token leaderboard since January, where distinguished engineers and VPs sit in the top ranks despite writing very little code in their actual roles. At Salesforce, engineers get a Mac widget that updates their personal token spend every 15 minutes and a tool that lets them look up any colleague’s spend. The minimum target last week was $100 on Claude Code and $70 on Cursor per engineer, per month.

Meanwhile, the data on whether any of this is actually working isn’t kind. Jellyfish looked at 7,548 engineers in Q1 2026 and found that engineers with the largest token budgets produced twice the pull requests at ten times the token cost, which is an efficiency problem even before you ask whether the PRs were any good. Faros AI’s March report found code churn up 861% under high AI adoption. Waydev, tracking more than 10,000 engineers at 50 customers, found that AI-written code looks like it’s accepted at 80-90% initially, but the real-world number drops to 10-30% once you count the rewrites made in the following weeks.

The charitable reads

Two versions of the steelman deserve airtime before I throw punches. A steelman is the strongest version of an argument you disagree with, the one worth engaging.

The first: at Meta’s scale, rolling out a new class of tooling to 85,000 engineers requires a forcing function stronger than “we think you should try this.” A visible leaderboard plus a performance-review signal are blunt instruments, but they do move adoption numbers. If the goal is getting a large engineering org over the activation energy of trying AI coding agents, and the cost of that is a year of gamed numbers, the trade might work out.

The second read is sharper. A long-tenured Meta engineer suggested the real goal of Claudeonomics wasn’t productivity measurement at all. It was generating real-world agent traces, at industrial scale, to train Meta’s next in-house coding model. A leaderboard disguised as a performance tool, that’s really a data-generation rig. Expensive, but Meta has the means, and if that’s the actual play, it’s a cleaner rationale than the public one.

Give both readings their full weight. Neither one makes the metric less broken on its own terms.

What the metric actually trains

An engineer at Microsoft willing to describe exactly what tokenmaxxing does to the person being measured. They’re not tokenmaxxing because they want to climb the leaderboard. They’re doing it because they don’t want to be seen as someone who “uses too little AI.”

Here’s what they admit to doing. If their internal documentation already has the answer to a question they need answered, they’ll route the question through Claude instead of reading the doc, because reading the doc would show up as low AI usage on the dashboard. Sometimes they prompt the agent to prototype features they have no intention of shipping, just to rack up spend. Other times they default to the agent on tasks they know they could finish faster by hand, and watch it fail.

Separately, a Meta engineer told The Pragmatic Engineer that some production incidents at the company looked like they came from careless AI code generation, where the responsible engineer seemed more focused on volume than on whether the code worked.

Read that again. Nobody at Microsoft hired their engineer to ask Claude what the docs say when the docs are right there. Nobody at Meta hired their engineer to ship code that causes outages. Both of them are doing these things because the measurement system is telling them to. Every new engineer joining one of these companies is watching and learning that the job includes burning tokens convincingly. That’s the skill the dashboard selects for, and the skill their replacements will practice.

The cost of a bad metric is never just the bad number on the screen. The real cost is the habits it trains into the engineers measured by it, and those habits outlast the metric by years.

We have done this before

Tokenmaxxing feels like lines-of-code as a productivity metric, all over again. We already ran that experiment across most of the 1980s and 90s. By the end of that stretch, the conclusion was settled: the best engineers don’t write the most code. The best engineers solve the hardest problems fastest, usually with less code than average, and sometimes with no code at all.

Tokenmaxxing is the same category error with a worse error bar. Lines of code at least landed in the repo, where another engineer could read them and call bullshit. Tokens just land on a bill. You can’t code-review a token.

Every engineering leader reaching for this metric should know this history. Some of them lived through it the first time.

What I’m watching on my team instead

At Auth0, I run a team that ships developer content and internal tooling. Here’s what I actually look at when I want to know if the engineers on my team are getting real value from their AI tools.

Are we closing more tickets this month than last month. Is content shipping on time, and when it ships, is it performing on the metrics we track for the business. Adoption on the tools we own is a number I can look at. So is revenue contribution on the projects my team is part of. The biggest question, six months into any given initiative, is whether the users we serve are getting more value than they were before we started.

That’s the list. It isn’t clever. It’s the boring collection of outputs a company is actually paying my team to deliver. When I was designing how I’d evaluate performance on this team, I spent more time than I’d like admitting trying to find something smarter. I couldn’t. The boring list holds up.

Here’s what I don’t measure. I don’t know what the token spend of anyone on my team is. It hasn’t come up in a 1:1, and it won’t come up in a review. If someone is shipping real work, whatever they spent to get there was worth it. If they’re not shipping, cutting their token budget isn’t the lever that fixes it.

The good news is that the smarter companies are already walking this back. Shopify ran one of the first token dashboards in the industry, back in 2025. By the time Gergely followed up with Shopify’s Head of Engineering earlier this month, the company had quietly renamed their “leaderboard” to a “usage dashboard” to stop the gamification, added circuit breakers to catch runaway agents, and started having their engineering leader personally check in with top spenders to understand what they were actually using the tokens for. One of the more interesting directions they’ve moved toward isn’t total spend but per-token cost: engineers whose individual tokens come out more expensive tend to be the ones doing deeper, harder work.

That’s a saner direction and it doesn’t require anyone to be brilliant. It requires engineering leadership to accept that the job of measurement is harder than reading a number off a dashboard, and to do the harder job anyway.

On a research team or a long-horizon infrastructure team, this gets harder. Outputs are slower and noisier in those contexts. But slower-to-measure outputs should be a prompt to find better output proxies. It’s not a license to start counting inputs.

I won’t run a leaderboard

Most of engineering measurement is still hard. What’s easy is this one: there’s nothing a token leaderboard tells you about an engineer that you couldn’t learn faster by asking them what they shipped this week, and what they’re stuck on.

The metric is dumb because the conversation it replaces is the job.

The Long Commit

Discussion about this post

Ready for more?