Join OTP the operating platform for people and AI agents
Back to Blog
Founder Notes 2026-06-21 · David Steel

Measuring an AI agent's performance requires a named human owner and a business-outcome metric, not an employee review

Most conversations about AI agent performance end up stuck between two bad analogies.

The first treats an agent like software. You measure uptime, tokens consumed, latency, tasks completed per hour. You watch dashboards that look like server monitoring. The numbers are clean. They say almost nothing about whether the business is getting better.

The second treats an agent like an employee. You talk about onboarding it, reviewing its performance, setting goals with it. It sounds reasonable, especially if the agent has a name and a defined role. It feels like management. The research says it leads to reduced accountability, more unnecessary escalation, and lower quality review by the humans who are supposed to be supervising it.

Neither frame is right. The research is clear on why, and so is the operating experience I have from running twelve agents at Sneeze It over the past eighteen months.

Why the software frame fails

If you measure an agent the way you measure software infrastructure, you will get accurate numbers for the wrong things.

Dash, our analytics agent, processes dozens of data requests per day. Arin, our call center manager agent, fires Slack coaching messages to our human callers every morning. Nick, our cold prospecting agent, drafts outbound emails at volume. By runtime metrics, all three look productive constantly.

But runtime metrics do not tell you whether Dash's analysis is changing the decisions my team makes. They do not tell you whether Arin's coaching is moving the appointment rate toward our 30% target. They do not tell you whether Nick's drafts are landing with qualified prospects or getting ignored.

The software frame makes agents look like infrastructure instead of contributors. Infrastructure does not hold seats on the org chart. Infrastructure does not have a row that goes red when a business outcome drops. Infrastructure monitoring is the right frame for a database. It is the wrong frame for an agent that carries work the business depends on.

Why the employee frame is the wrong correction

The obvious move when the software frame fails is to import the people-management playbook. Give the agent a role, write it a job description, run it through a performance review cycle. At Sneeze It, our agents do have names and job descriptions. Radar has a defined role as chief of staff. Dirk is accountable for pipeline. Tally pushes KPIs to the scorecard on a cadence. They each hold seats with clear ownership. It can look a lot like managing people.

Here is where the research cuts in, and I think it is worth taking seriously.

HBR and BCG published a study in May 2026 that ran a large controlled experiment on exactly this question. They found that treating AI agents like employees specifically, thinking of them in HR-category terms with titles, reviews, and performance conversations directed at the agent, produced measurable harm. Individual accountability declined. People escalated more questions that they should have resolved themselves. Review quality fell.

Their model is a "rented contractor with a narrow statement of work," governed by scoped permissions, kill switches, audit logs, and named human owners. The claim is not that agents do not deserve measurement. It is that the accountability must sit with a human, not transfer to the agent.

MIT SMR puts it plainly: agentic AI cannot be accountable for its decisions. The deploying human is.

These two findings point at the same thing. The software frame obscures the work. The employee frame displaces accountability. Neither is the structure you need.

What the right frame actually looks like

The frame that works is accountability architecture.

Every agent seat gets three things. First, a named human owner. At Sneeze It, every agent has a specific person who is accountable for what that agent produces. Bogdan, our COO, sits above several of our operational agents in the accountability chain. Janine, our accounting lead, is the named human owner for financial data flows. The agent does the work. A human owns the outcome. That ownership never moves.

Second, a business-outcome metric. Not a runtime metric. Not "messages sent" or "tasks processed per hour." The metric is the thing the seat is actually supposed to move. Tally's metric is KPI data accuracy and cadence. Arin's metric is the call center appointment rate against a 30% target. Dash's metric is whether the right accounts get flagged at the right time. Crystal's metric is project delivery health across our client base. If you cannot write a business-outcome metric for the seat, the seat does not have a clear enough role yet.

Third, a human-retained review process. Agents publish their numbers. Humans review them. The conversation about whether a number is up or down happens with the human seat-owner, not with the agent. The agent is not in the room defending its performance. The owner is. This is not a technicality. It is the structural choice that prevents the accountability displacement the research describes.

What "onboarding" and "retiring" actually mean under this frame

The language of onboarding and performance reviews creates confusion because people load those words with their employee-management meaning. Under an accountability architecture frame, both words get redefined.

Onboarding an agent means two things: establishing its scoped permissions (what can it touch, what is it restricted from, where are the kill switches) and defining its metric clearly enough that underperformance is visible. That is it. You are not introducing the agent to the culture. You are not setting development goals. You are defining the scope of work and the signal that tells you whether it is working.

Retiring an agent means a human decided the seat is no longer earned. It is not a performance improvement process. It is a structural decision made by accountable humans about whether the role produces value. In April, we retired Jeff, who had been our data integrity agent. The retirement happened through what we called a hearing. Humans reviewed the evidence. Jeff's capabilities were redistributed to other seats, notably Dash. The decision to retire was a human decision. The accountability for the outcome was a human accountability. Jeff did not make a case for his own survival and win or lose. The humans decided.

That is what retirement looks like when accountability architecture is the frame. The agent does not own the decision. The seat-owner does.

The causal argument for doing this correctly

Here is the reason this matters beyond getting the philosophy right.

When agents do not have named human owners, one of two failure modes appears. Either nobody is watching the number the seat is supposed to move, and the agent drifts without correction, or multiple people assume someone else is watching, and the accountability diffuses into noise.

When agents are measured by runtime metrics instead of business outcomes, the incentives drift. You optimize for what you measure. A seat optimized for messages sent per hour will send more messages. It will not necessarily send the right ones.

When the employee frame displaces accountability onto the agent, human reviewers lower their guard. The HBR/BCG finding is not abstract. If the agent has a title and a review cycle, the humans in the loop start to treat the agent's outputs as something that has already been reviewed, because something that looks like a review process is running. Review quality falls because the frame implies review has already happened.

The accountability architecture frame prevents all three. The named human owner means there is always someone watching. The business-outcome metric means you are measuring the thing that matters. The human-retained review means nobody treats the agent's output as pre-approved.

How this runs at Sneeze It week to week

Our weekly review runs one scorecard with humans and agents on the same rows. Bogdan and Janine have rows. So does Radar. So does Pulse, our client retention agent. So does Pepper, our executive assistant agent.

Each row shows the seat, the metric, the target, the current number, and a trend arrow. The conversation walks every row the same way: what is the number, is it above or below target, what explains the gap, what is the fix.

The difference is in who speaks to the gap. When Bogdan's number drops, Bogdan speaks to it. When Radar's number drops, Bogdan speaks to it too, because he is the named human owner of Radar's seat in the accountability chain. The agent produced the number. The human owns the explanation and the fix.

That structure is why agents at Sneeze It stay on track without requiring constant tuning. The accountability never moved to the agent. It stayed with the human who took the seat. That seat-owner has skin in the number because their row is on the same dashboard as everyone else's.

What this means for HR's mandate

The CHRO's job in the agent era is not to extend the employee lifecycle to agents. It is to build the accountability architecture that keeps human ownership clear as the agent workforce grows.

Korn Ferry found that only 5% of CHROs feel fully prepared for this moment, even though 42% have made AI investment a priority. The preparation gap is exactly here: most CHROs are trying to figure out whether agents are employees or tools, when the right frame is neither. Agents are seated contributors with named human owners and business-outcome metrics. The governance layer stays human. The work shifts to agents. That is the architecture.

Josh Bersin described the shift as redesigning how we get things done, with that work landing in HR's lap. The redesign is not writing job descriptions for agents. It is building the structure where agents carry the operational work, so people are free for the work that matters: judgment, care, culture, and the decision about what stays human.

The agent does the work. The human owns the outcome. That sentence is the policy. Everything else is implementation.

See the live chart

Every seat at Sneeze It, agent and human, is queryable via the OTP MCP: you can ask which seats are agent-owned versus human-owned, who the named human owner of each agent seat is, and what metric each seat is currently tracking.

In Claude Desktop or Cursor or any MCP client, add this block:

"otp": {
  "command": "npx",
  "args": ["-y", "@orgtp/mcp-server"]
}

Restart the client. Then ask: "Use OTP to show me the Sneeze It org chart. For each agent seat, show the named human owner and the primary metric."

That query returns exactly the accountability architecture described in this post -- which means you can see what the implementation looks like before you build yours.

DS
David Steel

Founder of OTP. Runs an AI agent army at a digital agency. Building OTP because nobody else seems to be building it. Notes from inside the build, not from the conference circuit.

More about David →

More posts on the blog index.

All posts