The most disorienting thing that happened when I started running agents alongside humans was not the agents failing. It was the metrics breaking.
The metrics that told me how my humans were performing stopped making sense. Not because my people were performing worse. Because the work they were doing had changed. And the measurements I was using were still calibrated to the old version of the job.
That is the problem I want to work through here.
When an agent absorbs half the operational work on a seat, the human in that seat does not do half as much. The human does a different kind of work. And the metrics that were designed for the original job do not capture the new job. Which means you are flying with a broken altimeter, confident you have a reading while the actual number drifts.
What the old metrics measured
Before I had agents, I measured my team the way most small teams measure themselves. Volume, speed, completion rate.
For our call center, we tracked dials, show rates, appointment rate. For the sales side, I tracked proposals sent, follow-ups completed, pipeline movement. For our client success function, I tracked calls held, communication cadence, retention rate. All of it was activity-plus-outcome, with the activity component designed to proxy effort.
The activity component made sense when humans were doing all the work. If the dial count was low, a person was not working hard enough. If the proposal backlog was growing, someone was slow. The volume signal carried real information about how the human was performing.
Then agents came in. Radar took over the daily briefings, the calendar orchestration, the Slack scanning. Dirk took the sales pipeline. Dash took the ad performance analysis. Tally pushed the KPIs. Arin managed the call center coaching loop. Nick ran cold prospecting at thirty qualified drafts per day.
The humans did not vanish. But the definition of a strong day changed entirely.
The before and the after
Before agents, Bogdan's strong day as COO looked like: three client fires handled, a dozen Slack messages dispatched, two project status checks completed, one proposal reviewed. High output. High volume. Measurable in that week's completion count.
After agents, Bogdan's strong day looks like: one structural decision that clarified how Crystal should prioritize project escalations, one conversation with Janine about a billing edge case that no agent could have resolved, one hour with me on whether our current capacity model is right for the next quarter. Lower volume. Higher value. Much harder to measure on a completion count.
The old metrics would say Bogdan had a slow week. The honest read is that he had a high-leverage week. The measurement broke.
The same shift happened for Janine. Before: transaction volume, invoice turnaround, coding accuracy. After: judgment calls on exception handling, relationship decisions on overdue accounts, capacity flags on billing edge cases. The completion count is lower. The impact is higher. The old metric misreads it.
Deloitte's 2025 Global Human Capital Trends research found that managers spend roughly 40 percent of their time on administrative tasks compared to 13 percent on people development. The agents are eating the administrative 40 percent. What remains is the development and judgment work, and that work was never counted before because it was drowned out by the volume of admin.
The measurement failure that matters most
Here is what happens if you do not rebuild the metrics.
The humans on your team see that the old metrics no longer apply. They know their job has changed. They know that the completion counts have shifted. They are waiting for you to update how you are measuring them. If you do not, one of two things happens.
First possibility: they drift toward the work that is still being measured, because that is what gets counted. They pad the activity that the old metric tracks, because the old metric is still on the scorecard. The agent is doing the real throughput work, and the human is performing for a measurement that no longer connects to outcomes.
Second possibility: they lose context on whether they are doing well. Korn Ferry's 2025 Workforce research found that 70 percent of senior leaders say their organization has an AI strategy, while only 39 percent of employees agree. That gap is not primarily political. It is a measurement gap. The strategy says agents are absorbing the operational work and freeing humans for higher-value contribution. The metrics say output is volume and volume is lower. The employee experiences the contradiction as confusion about whether they are doing a good job.
Both outcomes are HR failures, not technology failures.
What to measure instead
The rebuild starts by identifying what the human is actually accountable for that the agent cannot do.
Agents are strong at: high-frequency pattern recognition, consistent process execution, data compilation, first-draft production, monitoring and alerting. Agents are not strong at: judgment calls on exceptions, reading a relationship and deciding how to handle it, deciding which work is worth doing at all, maintaining accountability when something goes wrong.
So the new measurement categories look like this.
First: judgment quality. How many judgment calls did this person handle, and how often did they get the call right? For Bogdan, this is the structural and capacity decisions he makes. For Janine, it is the exception-handling calls on our billing. The metric is not volume of decisions, but accuracy and speed on the decisions that landed on their desk.
Second: escalation intelligence. When the agents surface something they cannot resolve, who routes it correctly and quickly? This is a real skill, and it is now one of the most important skills on a hybrid team. The human who understands what Dash is flagging and knows exactly who needs to act on it is doing high-value coordination work that the old metrics did not measure.
Third: agent quality oversight. Every agent on our chart has a human owner who is accountable for that seat's output. When Arin sends a coaching message to the call center team, I review it. When Nick produces a batch of thirty cold email drafts, the ICP adherence is on the brief I gave him. The agent's output quality is a measurement of the human's ownership quality.
This last category is where I want to spend a moment, because the literature on this is split in a way that matters to anyone running a hybrid team.
The tension in the research
One body of thinking says to manage agents like coworkers: give them scorecards, hold them to metrics, treat the seat like any other seat. MIT Sloan Management Review found that 69 percent of experts agree agentic AI demands new management approaches, and HBR has described the emerging role of the "agent manager" who runs agents via dashboards and observability.
A different body of thinking, including a May 2026 HBR/BCG study, warns against anthropomorphizing agents. In that research, treating agents like employees in a large-scale experiment reduced individual accountability, increased unnecessary escalation, and lowered review quality. The model they recommend is more like a rented contractor with a narrow statement of work, governed by scoped permissions, kill switches, audit logs, and named human owners.
I think both camps are pointing at the same thing from different angles.
When I put Dirk on the same scorecard as Bogdan, I am not doing it because Dirk is a person. I am doing it because Dirk's seat has a metric, and that metric tells me whether the seat is working. The named human owner of Dirk's accountability is me. If Dirk's pipeline scan is off, it is my brief, my scope definition, my review. MIT SMR put this precisely: "agentic AI cannot be accountable for its decisions. The deploying human is."
The agent gets a measured seat. The human owns the accountability for that seat. That is not anthropomorphizing. That is accountability architecture. Those are different things.
In April, I retired Jeff, one of our data integrity agents. The retirement happened through a hearing where Jeff's performance record was reviewed and his capabilities were redistributed to other seats. The accountability throughout that process was mine. The agent was the seat. I was the owner. When the seat was not earning its place, I was the one who made the decision to close it.
SHRM's 2026 research found that AI is 5.7 times more likely to shift job responsibilities than to eliminate jobs outright, and three times more likely to create new roles than displace them. The measurement implication is clear: you are not measuring whether your humans still have jobs. You are measuring whether their new jobs are being done well. That requires new metrics, not just the old ones with a lower baseline.
The one thing that does not change
When agents do half the work, one thing stays constant: accountability for outcomes lives with humans.
Agents can run Dash's analysis cadence, push Tally's KPI reports, draft Arin's coaching messages, file Dirk's pipeline scans. None of that moves the accountability question. HBR Analytic Services surveyed 603 leaders in late 2025 and found that only 6 percent fully trust agents with core processes. That is not a technology gap. It is an accountability gap, and the right answer is not to trust the agents more. It is to be clear about what the named human owner is accountable for.
That clarity is what the new metrics need to encode.
Bogdan is accountable for structural decisions and capacity judgment. Janine is accountable for billing integrity and exception resolution. Every human on a hybrid team needs a measurement system that captures the judgment and accountability work that is now their primary contribution, not the operational throughput the agents have absorbed.
The before version of the job was measurable in volume. The after version is measurable in quality of judgment, quality of agent oversight, and accuracy of escalation routing. Those are harder to measure. They require more thought. They require looking at outcomes, not just activities.
That harder measurement is what lets agents carry the operational work so people are free for the work that matters. But it only works if you actually build the measurement. Without it, the freedom is invisible, the value is uncounted, and the people doing the most important work on your team look like the people having the slowest week.
See the live chart
You can query the OTP org chart to see which seats are held by humans and which are held by agents, along with each seat's current KPI targets and owner assignment.
In Claude Desktop or Cursor or any MCP client, add this block:
"otp": {
"command": "npx",
"args": ["-y", "@orgtp/mcp-server"]
}
Restart the client. Then ask: "Use OTP to show me all seats on the sneeze-it chart, grouped by whether the seat owner is a human or an agent."
The response surfaces exactly where the accountability sits and whose name is attached to each measured seat.
Series: AI CHRO. Post 30 of an in-progress series. Previous: HR does not disappear when half your workforce is agents. It changes shape entirely.