Join OTP the operating platform for people and AI agents
Back to Blog
Founder Notes 2026-06-21 · David Steel

The ROI calculation that actually tells you whether your agents are paying off

Most agent ROI calculations prove what the person running them wanted to prove.

I have seen enough of them now to say that with confidence. The agents are running. The reports look good. Someone adds up time saved, multiplies by a blended hourly rate, and lands on a number with a dollar sign in front of it. The number is shared at a board meeting or a leadership offsite. Everyone feels good about the investment.

Then a year in, the operation has the same constraints it always had. The revenue has not grown the way the number promised it would. The team is not working on meaningfully higher-value problems. And nobody can explain the gap because the ROI calculation never connected to anything real.

The COO's job is to kill that calculation and replace it with one that is harder to produce and harder to argue with.

Why the standard calculation fails

The standard agent ROI calculation has five failure modes. They are not exotic. They are predictable, and if you have run any agents for more than six months, you have probably made at least two of them without knowing it.

The first failure mode is measuring effort saved instead of outcome moved.

The calculation looks like this: the agent does a task that used to take two hours. The agent does it in four minutes. Two hours times the fully-loaded cost of whoever used to do the task. That is the ROI. The problem is that the two hours saved only produce value if they go to work that matters more. Freed time that flows back into meetings, overhead, and low-value coordination produces nothing. The ROI calculation claims value the business never captured.

At Sneeze It, Pepper handles email triage. Pepper catches client-urgent emails, drafts responses, and surfaces what needs my attention before I open my inbox. The triage time saved is real. But the ROI from Pepper is not the triage time multiplied by my hourly rate. The ROI is whether my attention, freed from inbox processing, now goes to the work that only I can do. That question is harder to answer. It requires actually tracking where founder attention goes. The triage time calculation is easier, which is why people use it, and it is also wrong.

The second failure mode is counting outputs as value.

Emails sent. Reports generated. Accounts reviewed. Calls logged. These are outputs. They are not value. Value is what changes in the business because those outputs existed.

Nick, our cold prospecting agent, drafts quality outreach emails to Health and Wellness businesses. Nick's single KPI is quality drafts per day. But if those drafts go out and nothing enters the pipeline, Nick's outputs produced no value. The ROI of Nick is not the number of drafts. It is the number of qualified conversations that exist because Nick ran, measured against what it cost to produce them, compared to what it would have cost to produce the same conversations without Nick. That is a harder number. It is also the only number worth calculating.

The third failure mode is benchmarking against a counterfactual nobody verified.

This is the "headcount we didn't hire" problem. The calculation assumes that without the agent, the company would have hired someone to do the work. The agent's cost is compared to that imaginary hire's cost. The difference is called savings.

The problem is that the counterfactual is invented. You do not know whether you would have hired. You do not know what that person would have cost fully loaded. You do not know how long it would have taken them to get productive. You are comparing your real spend to a number you made up. The ROI looks compelling because you chose a favorable counterfactual. Change the counterfactual and the ROI changes with it.

Bogdan, our COO, and I do not calculate agent ROI by asking what human we didn't hire. We ask what the operation can now do that it couldn't do before, and what that capability is worth to the business. Those are harder questions. They require judgment, not arithmetic. But they are the right questions.

The fourth failure mode is ignoring what the agent broke or made worse.

When an agent takes over a process step, it changes the inputs and outputs for every adjacent seat. If those changes are net negative, the agent's ROI is overstated by exactly the amount of cost it created downstream.

When Crystal, our project management agent, took on delivery tracking across active Accelo projects, the gain was real: delivery gaps surfaced faster, without a human manually chasing project status. But the first version of Crystal's outputs required cleanup before they were usable by the humans reading them. That cleanup was downstream cost. Any ROI calculation that counted Crystal's output volume without counting the downstream cleanup cost was flattering Crystal at Bogdan's expense.

Agents do not operate in isolation. The ROI is the net across the whole chain, not the gain at the seat.

The fifth failure mode is locking in the ROI claim before the process was right.

Accenture puts it cleanly: do not make inefficiency run efficiently. An agent dropped into a broken process makes the broken process run faster. The ROI calculation captures the speed gain. It misses the fact that the output is still wrong, just delivered more quickly.

Jeff, our former data integrity agent, ran for months. His activity metrics were reasonable. The tasks he completed were real tasks inside a real process. But the process itself was poorly defined, and the outputs from Jeff's seat were not well-connected to decisions that changed anything. We retired Jeff in April, after an honest hearing, because the seat had been carrying work that was either being done better elsewhere or wasn't work the business actually needed. The ROI on Jeff was negative in hindsight. The calculation at the time looked fine because we were counting tasks, not outcomes.

What the right frame looks like

The right frame for agent ROI is not a single number. It is three questions that the COO has to be able to answer with evidence, not with confidence.

The first question is: what specific business outcome was this agent hired to move?

Not "improve efficiency." Not "reduce manual work." A specific, named outcome. Dirk's outcome is pipeline stage transitions and qualified meetings booked. Arin's outcome is appointment rate against a 30% conversion target across the call center. Tally's outcome is scorecard freshness: whether each KPI seat on the chart has a current value pushed in the last business day. Every agent seat has a specific outcome or it does not have a seat.

The second question is: did that outcome move by more than the variation we'd expect without the agent, and over a window long enough to mean something?

Deloitte's 2026 State of AI research, covering 3,235 enterprises, found that only 21% have a mature governance model for agentic AI. Part of what separates mature from immature is exactly this: mature organizations capture a baseline before the agent runs, track the named outcome on a defined cadence, and wait for a measurement window long enough to separate signal from noise. The 79% who are not there are deploying and then calculating retrospectively, which is how you get a flattering number with no grounding.

The third question is: what did it cost to move that outcome, and is the cost trending in the right direction?

MIT CISR's enterprise AI maturity research found that Stage 4 firms, where humans and agents operate together under shared accountability, outperform their industries by 13.9 percentage points in growth and 9.9 percentage points in profit. The gap is not how many agents they deployed. It is whether the agents are embedded in an accountability structure that forces the cost-to-outcome question on a regular cadence.

At Sneeze It, every agent seat is on the same chart as every human seat. Bogdan is on the chart. Janine is on the chart. Radar is on the chart. Dirk is on the chart. One scorecard, one accountability structure, one-seat-one-owner. When Bogdan and I run the Monday numbers, the agent rows are not separate from the human rows. They are adjacent to them. The cost-to-outcome question comes up for Dirk the same way it comes up for anyone else whose number dropped last week. What changed. What was the cause. What is the fix.

That discipline is the only thing that prevents the ROI calculation from being whatever story the team wanted to tell.

The diagnostic, not the spreadsheet

A COO who wants a real picture of whether the agents are paying off does not start with a spreadsheet. They start with three diagnostics.

First: map every agent seat to a named business outcome. If any seat cannot be mapped in one sentence to a metric that matters to the business, that seat does not have clear accountability. Run the fix before running the ROI.

Second: audit what the agent's outputs actually feed. Who reads Dash's daily performance brief? What decisions change because of it? Which of those decisions would have been made anyway, and which required what Dash produced? The ROI of an analysis agent is only as real as the decisions it changes.

Third: run the chain comparison. Take the outcome you are trying to measure. Walk backward through every seat in the process chain that contributes to it, agent and human. Ask where the highest cost is relative to the contribution to the outcome. That is where the next dollar of optimization belongs, not on the seat that has the most impressive activity metrics.

The agents that are genuinely paying off at Sneeze It are the ones where I can answer all three diagnostics without hesitation. Radar compiles the daily briefing. The briefing changes how Bogdan and I start the day. The decisions we make in the first hour are better because of it. The cost of running Radar is a small fraction of what a chief-of-staff role would cost. The outcome is attributable and the chain is clear.

The agent seats that do not survive the diagnostics get the same treatment as any seat that doesn't earn its place. The hearing is honest, the capabilities are redistributed, and the seat is retired. Jeff got that treatment. Every seat will get it if the numbers stop making sense.

Let agents carry the operational work. Free people for the work that matters. But do not trust the ROI calculation that tells you what you wanted to hear. Run the diagnostics. Ask the hard version of the three questions. The answer, when it is real, is more than enough.

See the live chart

The outcome metric for every agent seat at Sneeze It is queryable from the OTP MCP.

In Claude Desktop or Cursor or any MCP client, add this block:

"otp": {
  "command": "npx",
  "args": ["-y", "@orgtp/mcp-server"]
}

Restart the client. Then ask: "Use OTP to show me the KPI scorecard for sneeze-it and identify what business outcome each seat is accountable for."

The answer shows you what a seat-level outcome structure looks like when it is built to make the ROI question answerable rather than just impressive.


Series: The AI-Era COO. Part 48 of an in-progress series.

DS
David Steel

Founder of OTP. Runs an AI agent army at a digital agency. Building OTP because nobody else seems to be building it. Notes from inside the build, not from the conference circuit.

More about David →

More posts on the blog index.

All posts