How Do You Calculate the ROI of AI in Engineering?

The metrics most teams are using, token usage, lines of code, increased PRs, are not the best approach to measuring ROI. Here is what to measure instead and how to build a number your CFO will actually believe.

There are two numbers engineering teams love to quote when justifying their AI spend: token usage and lines of code. Both are pointless as ROI metrics. Token usage tells you how much the AI is working. Lines of code tells you how much it is producing. Neither tells you whether any of it is creating business value. And in the worst cases, more code shipped faster into a fragile deployment pipeline is actively misleading.

DORA’s 2026 report puts it plainly: “code is often seen as a liability, not an asset.” Operational costs of running applications and services far outweigh the costs of building them. Generating more code without proper oversight increases verification overhead and leads to long-term technical debt.

This post is about what to actually measure, how to build an ROI calculation that holds up to scrutiny, which frameworks give you different parts of the picture and why most organisations are measuring the wrong things and drawing the wrong conclusions.

The Misconception at the Centre of Most AI Investments

When engineering leaders present the case for AI tooling, they typically lead with productivity: developers are shipping more code, faster. That is true. DORA’s 2025 research found that increased AI adoption was associated with increased software delivery throughput. More changes, in less time.

What gets buried in the footnotes: the same research found that increased AI adoption was also associated with an increase in software delivery instability. More code, more failures. The verification tax, the cost of reviewing AI-generated output, grows in proportion to the speed of generation.

A developer using Cursor ships twice as many PRs. Those PRs are 154% larger on average. Median time in code review is up 441% on high-AI-adoption teams. The pipeline downstream of the individual developer has not scaled to match the output upstream of them.

This is what DORA calls the J-Curve of AI value realisation. When you adopt AI tooling, productivity temporarily dips as teams learn, pipelines adapt and verification overhead accumulates. Most organisations that fail with AI investments do not fail because the tools are bad. They fail because leadership sees the J-Curve dip, mistakes it for failure and pulls funding before the curve turns.

The organisations that succeed build for the J-Curve explicitly. They budget for the learning cost, invest in the underlying engineering platform and measure their way through the dip rather than reacting to it.

What ROI Actually Means Here

ROI is straightforward in principle: ROI (%) = (Value – Investment) / Investment. The hard part is defining value in a way that is credible to a CFO, not just an engineering team.

DORA’s framework identifies five value pillars for AI in software development, ordered from most directly measurable to least:

Cost efficiency. Direct reductions in development cost, infrastructure spend and operational overhead. The most quantifiable early signal of AI’s value.

Productivity. Developer time freed from routine tasks and redirected to higher-value work. This rarely shows up as headcount reduction, it shows up as more features shipped with the same team.

Developer experience. Reduced burnout, lower cognitive load, better retention. Real financial consequences (replacing a senior engineer costs 1.5–2× their annual salary), but harder to attribute directly to AI tooling.

User experience. Faster development cycles and higher product quality translate into better software, stronger retention and improved NPS. Attribution is challenging; user satisfaction responds to many variables simultaneously.

Business growth. AI investment feeds developer output, which shapes product quality, which influences user retention, which drives revenue. Every step adds measurement complexity and this is also the reason executives sign off on AI investments in the first place.

The Frameworks: What Each One Sees and What It Misses

No single framework gives you the complete picture. You need to layer them intentionally.

DORA: The Delivery Pipeline View

DORA’s four metrics, deployment frequency, lead time for change, change failure rate and time to restore service, remain the strongest foundation for measuring whether AI is actually accelerating safe software delivery. They are objective, pipeline-derived and directly tied to business outcomes.

For AI ROI specifically, DORA is the right lens for the J-Curve. Lead time for change tells you whether speed gains at the developer level are reaching production or being absorbed in manual testing and approval gates. Change failure rate tells you whether the instability that comes with early AI adoption is improving or accumulating.

What DORA misses: everything happening at the human level. A team can hit elite DORA scores while its best engineers are burning out, doing QA on AI output all day and quietly looking for the exit.

SPACE: The Developer Experience View

SPACE measures five dimensions: Satisfaction and well-being, Performance, Activity, Communication and collaboration and Efficiency and flow. For AI adoption, the most relevant dimensions are Satisfaction and Efficiency. Satisfaction tells you whether developers feel the tools are helping or creating work, a leading indicator of retention risk that shows up six to twelve months before it appears in attrition data.

What SPACE misses: it is survey-based, which means it reflects perceptions, not pipeline reality. You need both.

DX Core 4: The Unified View

DX Core 4 consolidates the best of DORA, SPACE and DevEx into four dimensions: Speed, Effectiveness, Quality and Business Impact. For AI ROI, the Quality dimension is where it earns its place. AI now writes 41% of code in high-adoption teams. Bug density is higher in projects with unreviewed AI-generated code. Nearly half of all AI code needs manual debugging in production even after passing QA and staging tests.

DevEx: The Flow and Friction View

DevEx focuses on three dimensions: feedback loops, cognitive load and flow state. For AI adoption specifically, cognitive load is the critical dimension. The verification tax, reading, evaluating and correcting AI output, is a significant and unmeasured cognitive burden on senior engineers.

GSM: The Goal-Driven View

GSM (Goals, Signals, Metrics) is most useful at the start of an adoption initiative and at quarterly reviews. It prevents the common failure mode of collecting every available metric, building a comprehensive dashboard and then being unable to answer the one question the CFO is actually asking.

A Layered Measurement Model

The frameworks above are not alternatives, they are layers.

Layer 1: DORA for pipeline health. Your objective baseline and J-Curve monitor. Measure it before adoption. Track it monthly after.

Layer 2: SPACE and DevEx for human health. Run quarterly surveys. Add specific questions about AI tool friction, cognitive load from verification and sense of ownership over output. This is your early warning system for the people cost of AI adoption.

Layer 3: AI-specific metrics. Token cost per successful deployment. Review time per AI-generated PR versus human-authored PR. Change failure rate attributable to AI-generated code. These require custom instrumentation, but they are the metrics that actually prove AI ROI at the code level.

Layer 4: Financial translation. Convert engineering metrics into business language. Recovered headcount from reduced rework. Revenue impact from faster feature delivery. Avoided downtime cost from improved recovery time.

The Metrics That Actually Matter

Lead time for change. From code committed to code in production. The best proxy for whether AI is actually accelerating delivery end-to-end.

Deployment frequency. If this is not increasing with AI adoption, the bottleneck is not the developer, it is the pipeline around them.

Change failure rate. Expect it to rise in the first three months. If it has not recovered by month six, your verification infrastructure has not kept pace with your generation speed.

Token cost per successful deployment. Connects AI infrastructure spend to engineering output. Should be declining as context engineering matures.

Review time per PR, segmented by AI versus human-authored. The verification tax made visible and attributable.

Developer experience score, quarterly. Leading indicator for everything that DORA will surface later.

Percentage of engineering time on unplanned work. The rework metric. This is where the “free headcount” calculation lives.

The Number Your CFO Wants

“Developers are 30% more productive” is not a number a CFO can stress-test. Here is the number that holds up:

Engineering capacity recovered from reduced rework, expressed as equivalent headcount.

If a team of 50 engineers spends 20% of their time on rework and AI reduces that to 10%, you have recovered the equivalent of five full-time engineers’ worth of capacity. At a £176,000 average salary, that is £880,000 of recovered capacity per year. Compare that directly against your AI tooling cost, context engineering investment and framework implementation overhead and you have a defensible ROI model.

Pair that with a deployment frequency increase connected to faster time-to-revenue and a change failure rate improvement connected to avoided downtime cost and you have a number that will hold up in a board presentation.

To Summarise

AI tooling in engineering is a capability amplifier, not a cost reduction play. The teams capturing real ROI are the ones that invested in the underlying engineering platform first, the deployment pipelines, the testing infrastructure, the review processes and then let AI accelerate what was already working.

No single framework gives you the complete picture. DORA tells you whether the pipeline is getting safer and faster. SPACE tells you whether the people running it are sustainable. DX Core 4 tells you whether quality is holding. DevEx tells you where the friction is. GSM tells you which of all those metrics to actually focus on for your specific goal.

The measurement approach that works is not the most comprehensive one. It is the most intentional one.