The weekly scorecard is the CEO's primary tool for staying honest about what the company is producing.
Not the all-hands. Not the quarterly review. Not the board deck. The scorecard. Every row a number, every number a seat, every seat accountable. Walk the rows on Monday and you know, without interpretation, whether the company is operating or drifting.
Most CEOs I know have run this meeting for years. A number is red, someone explains it, someone owns the fix. The discipline is not complicated. It just has to be consistent.
When I started adding AI agents to the org and putting them on the same scorecard as my human team, the discipline did not change. But the mechanics did. Seven things are different when agents are on the scorecard, and getting any of them wrong costs you the accountability that makes the meeting worth running.
Here is what I changed, and why each change was necessary.
1. The rows stopped being "human" or "agent." They became seats.
The most important structural shift was the simplest one. I stopped organizing the scorecard around who was doing the work and started organizing it around what seat the work belonged to.
At Sneeze It we have Bogdan, our COO, in a seat. We have Janine in an accounting seat. We have Radar in the chief-of-staff seat, Dash in the analytics seat, Tally in the scorecard-maintenance seat, Dirk in the sales seat, Pulse in the client-retention seat, Arin in the call-center-management seat, Nick in the cold-prospecting seat. Some seats are held by humans. Some by agents. The scorecard does not label the difference.
This matters more than it sounds. When the scorecard distinguishes "human rows" from "agent rows," the Monday meeting starts treating them as different kinds of accountability. The human rows get scrutiny. The agent rows get technical explanation. That asymmetry is where drift begins. One scorecard, one standard, no distinction by seat type.
2. Every agent row had to have a business metric, not a runtime metric.
The first version of Dash's scorecard row tracked "accounts scanned per day." That is a runtime metric. It tells you the agent is running. It tells you nothing about whether the business is improving.
The correct metric for Dash is "spend anomalies caught per week" and "guarantee-client coverage rate." The correct metric for Dirk is "qualified meetings sourced per week." The correct metric for Tally is "KPI push success rate." These are outcomes the seat is accountable for, not evidence that the agent is active.
Runtime metrics make the agent look like a machine that is either on or off. Business metrics make the agent look like a seat that is either performing or not. The Monday meeting can only drive accountability on the second kind of number.
Every time I have added an agent to the scorecard, the first draft of its metrics was a runtime metric. Every time, I had to translate it to a business outcome before the row was worth running.
3. Dropping rows now live next to their upstream cause.
A scorecard organized by department hides the most important thing the CEO needs to see: which drops are caused by what.
On the old scorecard, Dirk's "qualified meetings sourced" and Nick's "cold email reply rate" lived in different sections because they are different agents. But Nick is upstream of Dirk. When Nick's reply rate drops, Dirk's sourced meetings drop two weeks later. If those rows are not adjacent, the Monday conversation about Dirk's number misses its cause.
I reorganized the scorecard around workflow sequence. Nick feeds Dirk. Arin feeds the appointment volume that Dash monitors. Pepper feeds context that Radar compiles. The rows are ordered by dependency, not by function. This is harder to set up initially and easier to run every week after that. When a downstream row drops, the next row up the chain is sitting right there for inspection.
4. The cadence shifted from "who will fix it" to "which seat owns the fix."
When a human's row drops, the Monday conversation has always ended with a name. "Bogdan, you own this." "Janine, you own the follow-up." A human name closes the loop.
When an agent's row drops, the loop wants to close on a technical action: "we need to update the prompt" or "the model needs more context." That is not a bad thing to do, but it is not accountability. It is maintenance.
The right close for an agent's row is still a seat. Not "we need to retrain the agent" but "Dirk's seat-owner owns the fix, and here is what the fix is, and here is when it is due." I serve as seat-owner for most agents at Sneeze It. That means I am accountable for Dirk's number the same way Bogdan is accountable for his. The Monday conversation ends with me owning the fix, not with a standing action item to investigate the model.
This is the hardest shift for most CEOs. It feels wrong to be "accountable" for an agent the way you are accountable for a direct report. But that feeling is exactly what makes the unified scorecard work. Somebody has to own the agent's row. The scorecard forces you to name that person.
5. Retirement became a scorecard event, not a technical cleanup.
We retired Jeff, our former data-integrity agent, in April. The retirement happened because his three rows had been absorbed by other seats, his numbers had been static for weeks, and his seat no longer had a clear job. The decision was made on the scorecard, not in an IT review.
This is how agent retirement should work in a hybrid company. When a row has no meaningful metric, or when the metric is consistently at zero because another seat is doing the work, the row gets retired. Retiring a human seat is called a restructuring. Retiring an agent seat should carry the same weight and the same process, with the same public record kept.
Jeff's retirement involved a hearing, a record, and a redistribution of capabilities to Dash and other seats. The scorecard had a Jeff row, and then it did not, and the reason was documented. That discipline is what makes agent-seat retirement legible to the rest of the team.
Most companies do not retire agents this way. They just stop using them. The agent keeps its row on the dashboard, the metric goes stale, and nobody can tell whether the agent is working, broken, or just forgotten. Treating retirement as a scorecard event prevents that.
6. The Monday meeting had to get a structured slot for "seat health," not just "number health."
A human's scorecard row has a number. It also has a person you can ask about context. If Bogdan's number is flat but flat is fine this week for a structural reason, Bogdan can say so in thirty seconds.
An agent's scorecard row has a number. The context behind that number has to be published in advance, because the agent cannot speak up in the meeting. Radar publishes its briefing file before Monday. Dash publishes its alert file. Dirk publishes a state file with its sourcing activity. Every agent writes its own context document on the schedule the seat requires.
This means the Monday meeting reads the agent's file the way it would read a weekly update from a traveling direct report who cannot attend in person. The file is the voice. If the file is stale, the row gets flagged for no data, not walked as if the data were current. We added a standing "stale data" flag in the briefing protocol for exactly this. Any agent file older than eighteen hours is flagged before the meeting starts.
7. The scorecard became the source of hiring decisions for both humans and agents.
The unified scorecard changed where our next hire conversations start. Before agents, a gap in outcomes usually led to a conversation about whether we needed another person. Now, the same gap leads to a conversation about which kind of seat should close it.
When Arin's call-center performance rows showed a coaching gap, we did not immediately hire another call-center manager. We asked whether the coaching feedback loop could be automated and whether a seat existed for it. When Nick's prospecting rows showed us the volume we needed to hit, that informed whether we built Nick's seat out further before adding a human setter.
This is what the scorecard is actually for in a hybrid company. It is not just a performance tool. It is the hiring input. Every red row that persists for three consecutive weeks is a signal that the seat needs to change. Whether the fix is a new agent, a new human, a restructured process, or a retirement is the CEO's judgment call. But the scorecard is what surfaces the question.
Deloitte's 2026 State of AI survey found that only 21% of enterprises have a mature governance model for agentic AI. The companies without that governance are running agents, but they are not running them the way you run a workforce. They have no row on any scorecard. They have no seat-owner. They have no retirement protocol. They have deployment without accountability.
The scorecard is the governance model. Not the only element, but the weekly forcing function that makes everything else legible.
McKinsey's framing is useful here: managing in the AI era means managing systems of people and agents together. The scorecard is what makes that system visible every Monday.
The mission at Sneeze It is to let agents carry the operational work so people are free for the work that matters. That mission only holds if the agents are actually accountable for the operational work. Accountability needs a number, a row, a seat-owner, and a Monday meeting where someone looks at it.
Seven things change when agents are on that scorecard. Most companies are on none of them yet. Start with the rows.
See the live scorecard
Every seat on our chart, including the agent seats with their current metrics and seat-owners, is queryable from the OTP MCP.
In Claude Desktop or Cursor or any MCP client, add this block:
"otp": {
"command": "npx",
"args": ["-y", "@orgtp/mcp-server"]
}
Restart the client. Then ask: "Use OTP to show me all the seats on the Sneeze It scorecard and tell me which ones are agent seats versus human seats."
The response shows you exactly how a unified hybrid scorecard is structured, and gives you the pattern to build the same thing for your own team.