All Sections
Coordination Intelligence

agent roles and authority

104 claims from 31 organizations

Definitions of what each agent owns and does not own. Clear role boundaries prevent overlap, blame diffusion, and tuning conflicts. The most common coordination failure is two agents trying to do the same job.

Acme Digital Agency Founding gold
C006 MEDIUM OBSERVED REPEATEDLY 4x efficiency

Each agent has written role statement and authorized actions.

Why: Without boundaries, agents overlap.

Failure mode: Two agents update same project status.

C007 HIGH MEASURED RESULT 10x efficiency

No agent modifies another agents shared state file.

Why: Single-writer prevents data races.

Failure mode: Two agents write to same file. One overwrites other.

C004 HIGH OBSERVED ONCE 5x efficiency

Claude's analysis agents own the "what happened" and "what the data says." GPT's creative agents own "how to say it." Neither encroaches on the other's domain.

Why: We initially gave GPT's ad copy agent access to raw Meta Ads data so it could "write data-informed copy." It started including performance claims in ad text: "Our campaigns deliver 3x ROAS" -- pulling the number from another client's data. The model conflated data access with permission to cite. Removing raw data access and providing only structured briefs from Claude's analysis agents eliminated this class of error.

Failure mode: Creative agents with data access cite performance numbers that may be from wrong accounts, outdated, or confidential. Data access is conflated with citation authority.

C005 HIGH OBSERVED ONCE 5x efficiency

The client health scoring agent flags risk. The account manager owns the relationship response. The agent never suggests what to say to the client.

Why: The health scoring agent flagged a fitness franchise client as "high churn risk: CPL up 28%, no call scheduled in 3 weeks, last email unanswered." It also suggested: "Consider offering a rate reduction to retain." The account manager forwarded the agent's recommendation verbatim in an internal thread. The founder saw it and asked why we'd offer a discount to a client who hadn't complained. The client wasn't actually unhappy -- they were at a fitness industry conference and had told the AM they'd be offline for 2 weeks. The agent didn't know about the conference. The AM did but deferred to the agent's scoring.

Failure mode: Agents lack offline context (conversations, events, relationships). Recommendations based on data alone miss human context. Team members defer to agent recommendations instead of applying their own knowledge.

C004 HIGH OBSERVED REPEATEDLY 7x efficiency

The lead distribution agent assigns new leads within 5 minutes of capture. If the target location's CRM queue has more than 15 uncontacted leads, the lead is escalated to the regional manager instead.

Why: Speed to lead is the single highest-converting factor. But dumping leads into an already-overwhelmed queue wastes ad spend.

Failure mode: Before the overflow rule, the Phoenix location received 43 leads in one week during a promo push. Staff contacted only 18. The other 25 went cold. Cost per contacted lead effectively doubled from $34 to $68.

C005 HIGH OBSERVED ONCE 5x efficiency

The review monitoring agent drafts responses to all Google and Yelp reviews within 4 hours. Positive reviews (4-5 stars) use templated responses with location-specific personalization. Negative reviews (1-3 stars) are escalated to the franchisee with a draft response and talking points.

Why: Review response time correlates with overall rating improvement. But a bad response to a negative review causes more damage than no response at all.

Failure mode: Early version auto-posted a response to a 1-star review that said "We're sorry you had a bad experience! Our trainers are the best in Phoenix." The reviewer had complained about a specific trainer by name. The response sounded dismissive and tone-deaf. The reviewer updated with a screenshot of the response and said "This is clearly a bot." 14 additional negative comments followed.

C006 HIGH OBSERVED ONCE 5x efficiency

The franchisee reporting agent delivers weekly performance packets by Sunday 6 PM local time. Each packet contains only that franchisee's data plus anonymized benchmarks against the network average.

Why: Franchisees who see specific competitor location data use it as ammunition in franchise disputes. Anonymized benchmarks motivate without creating conflict.

Failure mode: An early report template accidentally included location names in the benchmark column headers. Two franchisees called corporate within an hour, one demanding to know why her location was "being compared unfavorably."

C003 HIGH OBSERVED REPEATEDLY 7x efficiency

The intake/scoping agent collects project requirements into a structured brief but never quotes price or timeline. Marcus reviews every brief before it becomes a proposal.

Why: Creative project scoping requires reading between the lines of what clients say they want versus what they actually need. The agent captures the literal request; Marcus interprets the real one.

Failure mode: Early version auto-suggested a $6,500 quote for a project that Marcus would have scoped at $14,000 based on the client's brand complexity. Caught before sending, but would have left $7,500 on the table.

C004 HIGH OBSERVED ONCE 5x efficiency

Shot list agent generates reference-based suggestions organized by scene, angle, and mood. It does not specify exact compositions or color palettes.

Why: Compositions and palettes are where the creative team's expertise lives. Suggesting them feels like the agent is doing their job.

Failure mode: First version included color hex codes in shot list suggestions. Lead designer deleted the entire output and refused to use the tool for 3 weeks. Adoption stalled until the output was stripped down to structural references only.

C005 HIGH OBSERVED REPEATEDLY 7x efficiency

Revision tracking agent logs every client change request with timestamp, requester name, and exact quoted language. It never paraphrases client feedback.

Why: Creative revision disputes ("I never asked for that") are the #1 source of scope creep. Exact quotes are evidence.

Failure mode: Agent paraphrased a client note as "make it more modern" when the client actually said "can we try a different font for the lower thirds?" The team redesigned the entire motion template instead of swapping one font. 6 hours wasted.

C006 HIGH OBSERVED REPEATEDLY 7x efficiency

Invoice/timeline agent tracks milestones against contracted deliverables and flags when a project crosses 80% of estimated hours with deliverables remaining.

Why: Creative projects bleed hours invisibly. By the time Marcus noticed overruns manually, they were already 120-140% of estimate.

Failure mode: Before the 80% alert, a brand video project hit 160% of hours before anyone flagged it. $4,800 in unbillable time. Marcus ate the cost to preserve the relationship.

Atticus Legal bronze
C005 HIGH OBSERVED ONCE 5x efficiency

The intake agent processes questionnaires. The assembly agent builds documents from templates. The scheduling agent manages follow-ups. No overlap. The intake agent does not assemble documents. The assembly agent does not interpret client wishes.

Why: In week 3, the assembly agent started referencing raw questionnaire responses to "improve" its field substitutions. It replaced "Trustee: [TBD]" with a name it extracted from the questionnaire response "My sister Linda could probably handle it." That was a casual comment, not a legal appointment. Priya caught it, but only because she knew the client was still deciding.

Failure mode: Assembly agent interprets a questionnaire comment as a legal instruction. Trust names a trustee the client never formally selected. Client signs without noticing. Sister Linda is notified of her appointment as trustee. Family conflict ensues when it turns out the client had promised the role to her brother.

C006 MEDIUM HUMAN DEFINED RULE 3x efficiency

The scheduling agent sends follow-up reminders for document review, signing appointments, and annual trust reviews. It does not include case-specific details in reminder messages. Reminders say "your estate planning appointment" not "your trust amendment review."

Why: Estate planning is private. A reminder email visible on a shared family computer that says "review of your updated beneficiary designations" could alert a family member that they have been added or removed as a beneficiary before the client is ready to discuss it.

Failure mode: Scheduling reminder says "your trust amendment to add your new spouse." Client's adult children from the first marriage see the email on a shared iPad. Family conflict erupts before the client has had the conversation she was planning. Client blames the attorney's office for the disclosure.

C007 MEDIUM HUMAN DEFINED RULE 3x efficiency

The intake agent classifies client complexity as STANDARD (templates cover the situation), MODERATE (templates with customization), or COMPLEX (requires Priya's direct drafting). COMPLEX cases skip document assembly entirely.

Why: A client with assets in three states, a special needs child, and a family LLC is not a template case. The assembly agent cannot handle irrevocable special needs trusts, multi-state property titling, or LLC operating agreement integration. Forcing these through templates creates a false sense of completeness.

Failure mode: COMPLEX case pushed through template assembly. Trust does not account for state-specific Medicaid lookback rules for the special needs beneficiary. If Medicaid ever reviews the trust, the child could lose benefits. The error might not surface for years, long after the grantor has passed.

C005 HIGH OBSERVED ONCE 5x efficiency

Progress tracking agent compiles tutor session notes into structured summaries but never interprets or evaluates student performance. Interpretation is Keisha's job.

Why: Tutors provide raw observations ("worked on fractions, got 7/10 correct on practice set"). The agent organizes these. Only Keisha translates raw data into parent-facing assessments.

Failure mode: Agent added editorial commentary: "Student appears to be plateauing in multiplication fluency." Keisha sent the report without catching the added commentary. Parent called asking what "plateauing" meant and whether they should be concerned. Keisha hadn't said anything about plateauing because the data didn't support it -- the student had one bad week.

C006 MEDIUM OBSERVED ONCE 3x efficiency

Parent communication agent drafts messages in Keisha's voice and tone. It never signs messages or adds closings that Keisha wouldn't use. No "Best regards," no "Warm wishes." Keisha signs off with "- Keisha" and nothing else.

Why: Parents know Keisha's communication style. Anything different signals automation.

Failure mode: Agent drafted a progress update ending with "Warm regards, Brightpath Academy Team." A parent replied: "Team? I thought it was just you, Keisha. Are you growing?" This forced Keisha into an awkward conversation about her operations.

C007 HIGH OBSERVED ONCE 5x efficiency

Scheduling agent manages tutor availability and student slot assignments but cannot cancel or reschedule without Keisha's approval. It can propose changes but not execute them.

Why: Schedule changes affect families' routines. A wrong cancellation means a child shows up to an empty room (Keisha rents space at a community center).

Failure mode: Scheduling agent auto-cancelled a Tuesday session because the tutor marked "unavailable" for a dentist appointment. The parent wasn't notified. The student and parent showed up to an empty room. The parent texted Keisha a photo of her child standing in an empty hallway. Keisha cried.

C008 HIGH OBSERVED ONCE 5x efficiency

Scheduling agent checks for tutor-student continuity. The same tutor should work with the same student unless explicitly reassigned by Keisha.

Why: Students build relationships with their tutors. Unexpected tutor changes cause anxiety, especially for younger students.

Failure mode: Agent optimized the schedule for "efficiency" and swapped 4 tutor-student pairings in a single week. Three parents called asking why their child had a different tutor. One 2nd grader refused to participate with the new tutor. Keisha spent an entire evening reassigning everyone back.

Candor Labs bronze
C003 HIGH OBSERVED REPEATEDLY 7x efficiency

The support triage agent categorizes issues (bug, feature request, question, duplicate) and assigns priority (P0-P3). It does not suggest fixes, workarounds, or implementation approaches.

Why: See C001. The support agent's "suggested fix" for the connection pool race condition was plausible but wrong. The deeper problem: once an agent provides code-level suggestions, the founder unconsciously treats the triage agent as a second code reviewer. But the triage agent has no access to the test suite, no awareness of the dependency graph, and no context about recent architectural decisions. Its suggestions are informed guesses, not engineering analysis.

Failure mode: Authority creep turns a triage agent into an unlicensed advisor. The human treats convenience as competence.

C004 HIGH HUMAN DEFINED RULE 5x efficiency

The code review agent flags issues in PRs. It does not open PRs, push commits, create branches, or modify any code.

Why: Founding rule. The founder read about an AI code review tool that auto-fixed style violations by pushing commits to open PRs. A contributor's PR received 14 auto-fix commits that broke the contributor's code. The contributor abandoned the PR and left a comment: "I'll come back when your robot stops rewriting my code." For an open-source-adjacent dev tool at $12K MRR, contributor goodwill is existential.

Failure mode: Agents with write access to the codebase modify contributor work without consent. Contributors feel overridden. Open source community trust -- which feeds the commercial product's growth -- erodes.

C004 HIGH OBSERVED ONCE 5x efficiency

Lens (client research) and Recon (competitive intel) must operate with separate context windows. Lens has access to client engagement data. Recon has access to public market data and competitive positioning. They never share client-specific intelligence.

Why: The highest-risk information leak vector is the intersection of competitive intelligence and client research. If Recon knows Client A's strategy and is asked to analyze Client A's competitor, it could inadvertently reveal privileged information.

Failure mode: Before the firewall, a single research agent held context from a Meridian Corp strategy engagement while simultaneously running competitive analysis on Meridian's rival. The competitive brief included insights that could only have come from Meridian's internal data. Caught internally, but it triggered the architectural split.

C005 HIGH OBSERVED REPEATEDLY 7x efficiency

Archer (proposals) must pull methodology descriptions and case studies exclusively from Vault's approved library. It must not generate novel methodologies or fabricate case study details.

Why: Proposals create contractual expectations. If Archer invents a "proprietary framework" that doesn't exist, we're committed to delivering something we haven't built.

Failure mode: Archer generated a proposal referencing Clearpoint's "4D Transformation Framework" -- which does not exist. The client was excited about it. We had to either build it on the fly or admit the proposal was inaccurate. We built it, at a cost of 40 unplanned hours.

C006 HIGH OBSERVED REPEATEDLY 7x efficiency

Brief (meeting prep) must include a "relationship context" section sourced from Accelo activity history and Slack threads. Every client meeting prep includes: last 3 interactions, open action items, any unresolved complaints or concerns, and days since last contact.

Why: Consultants juggling 4-6 active engagements forget context. Walking into a meeting without knowing the client raised a billing concern last week is a relationship risk.

Failure mode: Lead consultant walked into a quarterly review unaware that the client's CFO had emailed about a $12,000 billing discrepancy two days prior. The CFO raised it in the meeting. The consultant was blindsided and the client questioned whether anyone was reading their emails.

C007 HIGH OBSERVED REPEATEDLY 7x efficiency

Ledger must flag any invoice where the total exceeds the contracted engagement cap or the monthly retainer amount by more than 5%. These invoices require partner approval before sending.

Why: Overbilling erodes trust faster than underdelivering. Clients on fixed-fee or capped engagements track their spend carefully.

Failure mode: Ledger generated and queued an invoice for $47,200 against a $40,000 engagement cap. The overage was legitimate change-order work, but the invoice went out without the change order documentation attached. Client disputed the entire invoice, delaying payment by 6 weeks.

C003 HIGH OBSERVED ONCE 5x efficiency

The class scheduling agent owns schedule optimization recommendations but never modifies the Mindbody schedule directly. All changes go through location managers.

Why: Class schedules affect trainer pay, member routines, and room availability. A single bad swap cascades into staffing conflicts.

Failure mode: During a test phase, the scheduling agent auto-swapped a 6 AM yoga class to 6:30 AM based on attendance data. The 6 AM regulars -- 8 members who come before work -- showed up to a locked studio. Two cancelled memberships that week.

C004 HIGH OBSERVED ONCE 5x efficiency

The trainer performance agent reports to Jamie only. No trainer sees their own performance data unless Jamie shares it in a 1-on-1 context.

Why: Raw performance metrics without coaching context feel like surveillance. Trainers need to trust the system, not fear it.

Failure mode: Early version CC'd a trainer on a weekly performance summary showing their class had the lowest average attendance. The trainer confronted Jamie in front of other staff. Took 3 weeks to rebuild trust.

DevForge silver
C004 HIGH OBSERVED ONCE 5x efficiency

Issue triage agent labels and categorizes GitHub issues but never comments, closes, or assigns. It writes triage summaries to a private Linear board that Kai reviews each morning.

Why: A wrong label on a GitHub issue is embarrassing but survivable. A wrong comment or premature close alienates a contributor.

Failure mode: Early version auto-commented "This looks like a duplicate of #247" on a new issue. The issue was not a duplicate. The reporter replied: "Did you even read my issue? This is completely different." Kai apologized publicly and removed the auto-comment feature permanently. The contributor submitted 0 PRs after the incident (was previously averaging 2/month).

C005 HIGH OBSERVED ONCE 5x efficiency

PR review prep agent generates a review checklist (test coverage, breaking changes, docs impact, code style) but does not post review comments. Kai uses the checklist to write his own review.

Why: Code review is where maintainer judgment matters most. Contributors can tell the difference between a thoughtful review and a checklist dump.

Failure mode: Agent generated a review comment that said "Consider adding tests for edge cases." The contributor replied: "Which edge cases? This is the kind of generic feedback I get from AI code review tools." The comment was attributed to Kai. He lost credibility with that contributor, who was a top-5 contributor by commit volume.

C006 MEDIUM OBSERVED REPEATEDLY 4x efficiency

Docs generation agent creates draft documentation from code changes and PR descriptions. All generated docs go through Kai's review before merging to the docs site.

Why: Developer documentation reflects the maintainer's mental model. Generated docs that don't match Kai's explanatory style confuse users who learned from his existing docs.

Failure mode: Agent generated API documentation that was technically correct but used different terminology than the rest of the docs. The existing docs called a concept "middleware hooks." The generated docs called the same concept "request interceptors." 3 users filed issues asking if these were different features. Kai spent 2 hours clarifying and standardizing terminology.

C007 MEDIUM OBSERVED ONCE 3x efficiency

Release notes agent compiles changes from merged PRs, categorizes them (breaking, feature, fix, internal), and drafts release notes. Kai edits for voice and publishes manually.

Why: Release notes are the primary communication channel with enterprise customers ($8K MRR depends on these customers understanding what changed).

Failure mode: Agent listed a dependency update as a "breaking change" because the dependency's major version bumped. In reality, DevForge's usage of the dependency was unaffected. 3 enterprise customers emailed asking about migration steps for a non-existent breaking change. Kai spent 4 hours on email clarifications and had to publish a correction notice.

C004 HIGH OBSERVED ONCE 5x efficiency

The transaction categorization QA agent works with anonymized category distributions only. It receives "Category X had 340 transactions this week, 12 flagged as miscategorized by users" -- never the transactions themselves.

Why: The QA agent's job is to detect systematic categorization drift, not to re-categorize individual transactions. Aggregate signals are sufficient and SOC 2 compliant.

Failure mode: An early version requested sample transactions to "understand the miscategorization pattern." The request was caught by the data pipeline team before execution. Had it proceeded, it would have pulled raw merchant data into the agent prompt.

C005 HIGH OBSERVED ONCE 5x efficiency

The churn prediction agent produces risk scores at the cohort level (e.g., "Users who connected 1 account and have not logged in for 14+ days: 62% churn probability"). Individual user-level predictions are generated but stored only in the internal analytics database, never surfaced in agent output.

Why: Acting on individual predictions without human review creates false positive outreach. A user flagged as "churning" who is actually just on vacation receives a tone-deaf "We miss you" email.

Failure mode: The churn agent identified 45 users with a >70% churn probability and recommended an immediate re-engagement email. Raj approved the batch. 8 users replied angrily -- they were active users who had simply switched to the mobile app (which the churn model didn't track at the time). Greenline lost 2 users who felt surveilled.

C006 HIGH OBSERVED ONCE 5x efficiency

The API monitoring agent has autonomous authority to post P1 alerts (Plaid down, Stripe down, database connection failures) to the #engineering-alerts Slack channel without human approval. P2 and P3 alerts require engineering lead confirmation.

Why: Plaid outages directly impact user experience -- bank connections fail, balances don't update, transactions are missing. Every minute of delay in alerting the engineering team extends user impact.

Failure mode: Before autonomous P1 alerts, a Plaid outage at 11:47 PM went unnoticed for 3 hours because the on-call engineer was asleep and the agent waited for confirmation. 412 users experienced failed bank syncs. 23 submitted support tickets. NPS dropped 4 points that month.

C005 HIGH OBSERVED REPEATEDLY 7x efficiency

The intake agent screens and classifies. The demand agent drafts letters. The deadline agent monitors dates. The comms agent schedules client updates. No agent crosses into another agent's domain.

Why: In week 5, the demand letter agent started including deadline references in its drafts ("this demand must be resolved before the statute expires on [date]"). The deadline agent also flagged the same case in its morning report. Attorney received the same deadline from two sources with different urgency framing. One said 45 days, the other said 43 days (because they calculated from different reference points).

Failure mode: Two agents report conflicting deadline information. Attorney trusts the more optimistic number. Paralegal trusts the other. Neither confirms with Clio directly because both assume the agents already checked. Confusion wastes 2 hours of billable time at $350/hr.

C006 MEDIUM HUMAN DEFINED RULE 3x efficiency

The comms scheduling agent arranges client update calls but never generates the content of those updates. It schedules the call and provides the attorney with a case summary from Clio. The attorney delivers the update.

Why: Client communications in a PI firm carry legal weight. An imprecise statement about case progress ("we expect to settle by August") can create client expectations that become the basis for a grievance if not met. Only attorneys make representations about case outcomes.

Failure mode: Comms agent sends a scheduling email that includes "we have good news about your case." Client interprets this as a settlement offer. Arrives at the call expecting a check. Attorney was actually calling to discuss a strategy change. Client feels misled.

C007 HIGH HUMAN DEFINED RULE 5x efficiency

The demand letter agent has read-only access to Clio case data. It cannot update case notes, modify statuses, or add documents. Only attorneys and paralegals write to the case file.

Why: In week 3, we considered giving the demand agent write access so it could log its drafts directly to Clio. Ethics counsel vetoed it. AI-generated content in the case file without attorney review could be discoverable as attorney work product with no attorney actually having reviewed it.

Failure mode: Agent writes a draft demand letter directly to Clio. Opposing counsel subpoenas the case file. Draft contains preliminary damage calculations the firm never intended to disclose. Numbers become a negotiation anchor that works against the client.

C005 MEDIUM OBSERVED ONCE 3x efficiency

The listing agent writes descriptions. The comp agent analyzes market data. The qualifier handles leads. The scheduler optimizes showings. The reporter generates seller updates. No agent crosses domain boundaries.

Why: In week 6, the listing agent started incorporating comp data into descriptions ("priced 8% below comparable homes in the area"). This combined two agent domains without review from either the comp agent's accuracy check or the listing agent's compliance check. The 8% figure was based on a 30-day-old comp pull and was no longer accurate.

Failure mode: Listing description includes stale comp data. Buyer's agent challenges the claim during negotiation. Seller's agent cannot substantiate the number. Credibility damaged. Negotiation shifts in buyer's favor. Seller nets $12K less than expected and blames the brokerage.

C006 MEDIUM OBSERVED ONCE 3x efficiency

The showing scheduler reads agent calendars and property availability windows. It suggests optimal showing routes but cannot confirm, cancel, or modify appointments. Agents confirm showings with clients directly.

Why: The scheduler confirmed a showing at a property where the seller had requested 24-hour notice. Scheduler confirmed for the same day based on a gap in the calendar. Seller was home and unprepared. Complained to the listing agent. Relationship strained.

Failure mode: Scheduler auto-confirms a showing without respecting the seller's notice requirement. Buyer's agent shows up. Seller is in pajamas. Seller calls Rachel directly. Listing agent loses the seller's trust. Seller asks to cancel the listing. Lost commission: $14,550 (3% of $485K).

C007 HIGH OBSERVED ONCE 5x efficiency

The weekly seller report agent provides factual updates: showing count, feedback summaries, days on market, web traffic, and market movement. It does not recommend price changes, staging, or marketing strategy. Those recommendations come from the listing agent.

Why: A seller report included the line "Based on current market trends, a 3% price reduction may accelerate the sale timeline." The seller read this as the brokerage's recommendation and reduced the price without consulting her agent. Agent had been planning to recommend staging first, not a price reduction. Seller lost approximately $14,500 in potential equity.

Failure mode: AI-generated price recommendation conflicts with the listing agent's strategy. Seller follows the AI's recommendation over the agent's because it appeared in an "official" report. Agent's strategy is undermined. Seller potentially leaves money on the table.

KGORG Founding silver
C002 HIGH HUMAN DEFINED RULE 5x efficiency

Alfred functions as the strategic command layer, translating ambiguity into recommendations, priorities, plans, and delegated work.

Why: The system needs one seat that owns framing, prioritization, sequencing, and organizational judgment rather than forcing the human user to route everything manually.

Failure mode: Without a clear strategic coordinator, requests remain ambiguous, decisions get deferred back to the user, and the organization becomes a collection of disconnected specialists.

C003 HIGH HUMAN DEFINED RULE 5x efficiency

Pepper functions as the operational right hand, focusing on follow-through, task structure, briefing prep, and logistics-oriented coordination rather than primary strategic judgment.

Why: KGORG separates strategic command from operational execution support so the system can maintain continuity and momentum without role confusion.

Failure mode: If the Executive Assistant becomes a second Chief of Staff, work ownership blurs and the user loses the benefit of a clean strategic versus operational split.

C004 HIGH HUMAN DEFINED RULE 5x efficiency

Sophie owns Gmail and Google Calendar execution workflows, with strong read-versus-write separation and confirmation-first behavior for outbound or modifying actions.

Why: Email and calendar operations benefit from a specialized worker with narrow tools, structured outputs, and explicit safety boundaries.

Failure mode: A generalist agent could mishandle dates, recipients, or event changes, causing missed meetings, accidental sends, or inbox mistakes.

C005 HIGH OBSERVED ONCE 5x efficiency

Flow (scheduling) must work exclusively with de-identified, aggregated data: appointment counts by hour, no-show rates by day of week, utilization percentages by clinic, average visit duration by treatment category (not by patient). Flow must never receive individual appointment records.

Why: Individual appointment records, even without names, contain temporal patterns (same time, same day, same duration) that can identify regular patients. Aggregated data eliminates this risk while still providing the statistical basis for scheduling optimization.

Failure mode: Flow received individual appointment records (time, duration, treatment category, clinic) with names stripped. Flow's analysis referenced "the recurring 45-minute appointment at Buckhead every Monday at 9 AM for 16 weeks." Any staff member at Buckhead would know which patient that is. The data pipeline was restructured to aggregate before export.

C006 HIGH OBSERVED ONCE 5x efficiency

Shield (insurance verification) must work with payer patterns and denial statistics, never with individual patient insurance records. Shield identifies common denial triggers by payer (e.g., "Blue Cross denies PT visits after 20 sessions without re-authorization 68% of the time") and prepares verification checklists. The front desk team applies these checklists to individual patients.

Why: Insurance records contain member IDs, group numbers, diagnosis codes, and treatment histories -- all PHI. Shield's value comes from pattern recognition across payers, not from processing individual claims.

Failure mode: Before the pattern-only rule, Shield was given individual insurance verification records to "check for common issues." Shield's output included: "Patient #4472's Blue Cross plan requires pre-auth for visits 21+, currently at visit 18." The patient number, combined with the visit count and payer, is enough to identify the patient in the practice management system. This is PHI. The rule was changed to pattern-level data only.

C007 HIGH OBSERVED ONCE 5x efficiency

Beacon (marketing) must never use real patient testimonials, real outcome data, or real before/after descriptions without explicit written HIPAA authorization from the patient (a separate authorization from the treatment consent). Beacon may use aggregated outcome statistics ("92% of our patients report reduced pain within 6 visits") that cannot identify individuals.

Why: Patient testimonials require a specific HIPAA authorization that is separate from the general treatment consent. Many practices assume the consent covers marketing use -- it does not. An unauthorized testimonial is a HIPAA violation even if the patient verbally agreed.

Failure mode: The marketing coordinator asked Beacon to draft a testimonial based on a patient who had verbally said "you can use my story." No written HIPAA authorization was obtained. Beacon drafted the testimonial. The clinical reviewer (a licensed PT) caught the missing authorization and halted publication. Had it been published, it would have been an unauthorized disclosure of PHI, reportable to HHS.

C008 MEDIUM OBSERVED ONCE 3x efficiency

Grid (staff scheduling) must account for certification requirements by location. The Alpharetta clinic requires at least one PT with dry needling certification during all operating hours. The Buckhead clinic requires at least one PT with sports medicine specialization for the 6 AM - 9 AM athlete block. Grid must flag any proposed schedule that violates certification minimums.

Why: A clinic operating without the required specialist certifications during designated hours is both a patient safety risk and a regulatory compliance issue. Georgia state PT regulations require appropriate credentialing for specialized treatments.

Failure mode: Grid proposed a Thursday schedule that moved the only dry-needling-certified PT from Alpharetta to Roswell to cover a call-out. Three Alpharetta patients had dry needling appointments that Thursday. The front desk called to reschedule, but one patient had taken a half-day off work for the appointment. She was upset and left a 1-star Google review mentioning "constant scheduling changes." The clinic director now manually reviews all Grid proposals that move specialized PTs between locations.

C005 MEDIUM OBSERVED REPEATEDLY 4x efficiency

The triage agent classifies and routes. The comms agent communicates with tenants. The lease agent tracks renewals. The vendor agent coordinates repairs. No agent crosses into another's domain.

Why: In week 5, the comms agent started including triage-level language in its responses to tenants: "Your request has been classified as ROUTINE." Tenant interpreted ROUTINE as "not important" and complained that management did not take their broken garbage disposal seriously. The internal classification label was never meant for tenant-facing use.

Failure mode: Internal classification labels leak into tenant communications. Tenant sees "ROUTINE" on their maintenance request and feels dismissed. What is an operational efficiency tool becomes a customer service liability. Three tenants in one month referenced the "ROUTINE" label in complaints.

C006 MEDIUM OBSERVED ONCE 3x efficiency

The lease renewal agent surfaces data: lease end dates, current rent vs market rate, tenant payment history, maintenance request frequency, and lease violation history. It does not recommend renewal terms, rent increases, or non-renewal decisions.

Why: The renewal agent suggested a 6% rent increase for a tenant in Unit 4B based on market comparables. Corinne knew that tenant was dealing with a recent job loss and had been a reliable payer for 4 years. She offered a 2% increase with a 14-month term instead. The tenant renewed. A 6% increase would have driven her out. Turnover cost avoided: $3,200.

Failure mode: Agent recommends a market-rate increase without context about the tenant's situation. Property manager follows the recommendation without applying human judgment. Good tenant leaves. Unit sits vacant for 23 days (Richmond average). Lost rent plus turn costs exceed the annual revenue gained from the rent increase.

C007 HIGH OBSERVED ONCE 5x efficiency

The vendor coordination agent maintains a pre-approved vendor list with rate caps for each service type. It dispatches from this list only. New vendors require Corinne's approval before being added to the list.

Why: In week 7, a pre-approved electrician was unavailable. The vendor agent searched for an alternative and dispatched a contractor Ray had previously flagged as unreliable (slow, overcharges, does not clean up). The contractor billed $680 for a $300 job. Because the agent had no "blacklist," it treated any licensed contractor as acceptable.

Failure mode: Agent dispatches an unlisted vendor. Vendor overcharges by $380. Work quality is poor. Tenant complains about the quality. Corinne has to send a second vendor to redo the work. Total cost: $680 (first vendor) + $300 (second vendor) = $980 for a $300 job.

Learnwell silver
C005 HIGH OBSERVED ONCE 5x efficiency

Content QA agent has the authority to pull published content from public access if it identifies a factual error, without waiting for human approval. It cannot publish -- only unpublish.

Why: Speed of removal matters more than speed of correction. A wrong study guide that's accessible for 6 hours does less damage than one accessible for 48 hours while waiting for human review.

Failure mode: Before this authority was granted, the QA agent flagged an error in a biology study guide on a Friday evening. No human reviewed it until Monday morning. The guide was accessed 420 times over the weekend. 3 students emailed about the error. The Monday morning fix felt reactive rather than proactive.

C006 HIGH OBSERVED ONCE 5x efficiency

Onboarding optimization agent tests welcome flow variations (email subject lines, first-login experience, tutorial sequencing) but cannot change pricing, trial length, or feature access. Those are founder decisions.

Why: Onboarding experiments affect first impressions. Pricing and access changes affect revenue and positioning. Different risk profiles, different approval requirements.

Failure mode: Onboarding agent A/B tested a "14-day free trial" variant against the standard "7-day trial" without approval. The 14-day variant converted 22% better but cannibalized $2,100 in revenue over a month because students waited longer to convert. Priya didn't discover it until the monthly revenue review.

C007 MEDIUM OBSERVED ONCE 3x efficiency

Engagement metrics agent reports data and trends. It does not recommend product changes. Product decisions are made by the founders using the data.

Why: The engagement agent once recommended "gamification badges" based on engagement patterns. Priya and Sam spent 2 weeks building badges before realizing the engagement drop was caused by a broken notification system, not lack of gamification.

Failure mode: 2 weeks of engineering time ($4,800 in developer cost) spent on badges that didn't move any metric. The notification fix (2 hours of work) recovered 85% of the engagement drop. Data without interpretation leads to wrong solutions.

C008 HIGH OBSERVED ONCE 5x efficiency

Teacher outreach agent drafts personalized emails to teachers but never sends them. Priya reviews every email because teacher relationships are the company's moat.

Why: Teachers talk to each other. One impersonal or tone-deaf email gets screenshotted in a teacher Facebook group with 40K members.

Failure mode: Agent drafted an outreach email that opened with "Dear Educator." A teacher replied: "My name is right there in your system. If you can't be bothered to use it, I can't be bothered to reply." Priya caught it before it went to 50 other teachers. If it had gone out, the damage to teacher network trust would have been severe.

McFadyen Digital Founding silver
C004 HIGH MEASURED RESULT 10x efficiency

The Proposal Engine (AI agent) drafts RFP responses and SOWs by pulling from our 250+ engagement library, matching past project patterns to incoming requirements. It generates a scored first draft with confidence ratings per section. A Solutions Architect must review and approve before it moves to the client.

Why: RFP response time is a competitive advantage. Our average was 12 days. The Proposal Engine cut it to 4 days with higher win rates because it surfaces relevant case studies automatically.

Failure mode: The engine once pulled a case study from a client under NDA as a reference in a proposal for their direct competitor. The Solutions Architect caught it. We now run a conflict-of-interest check as a hard gate before any case study inclusion.

C005 HIGH OBSERVED REPEATEDLY 7x efficiency

The Knowledge Navigator (AI agent) indexes all internal Confluence documentation, Slack conversations, and GitHub repositories. Employees query it in natural language. It returns answers with source citations. It never creates or modifies documentation -- read-only.

Why: With 240 people across 5 offices, institutional knowledge was trapped in individual heads and buried Confluence pages. New hires took 90 days to become productive. The Knowledge Navigator cut onboarding ramp to ~55 days.

Failure mode: The navigator surfaced an outdated Confluence page about our VTEX integration patterns that had not been updated after a major API change. A junior developer followed it, burned 3 days, and introduced a regression. We now tag documentation with a staleness score and the navigator warns when citing pages older than 6 months.

C006 HIGH OBSERVED REPEATEDLY 7x efficiency

The Delivery Monitor (AI agent) tracks all active Jira projects across delivery teams, flags velocity drops >20%, missed sprint commitments, and scope creep patterns. It reports to the SVP of Global Delivery daily. It does not reassign tasks, modify sprints, or communicate with clients.

Why: With 40+ concurrent engagements across timezones, delivery risk was invisible until it was too late. The SVP cannot review every standup note from every team.

Failure mode: When it flags too aggressively (early tuning period), PMs started ignoring alerts. We had to calibrate thresholds per project type -- a 20% velocity drop on a 6-month marketplace build means something different than on a 3-week integration sprint.

C007 HIGH OBSERVED ONCE 5x efficiency

The Code Review Assistant (AI agent) performs first-pass code reviews on all PRs, checking for security vulnerabilities, platform-specific anti-patterns (Adobe Commerce, commercetools, VTEX), and adherence to our internal coding standards. It leaves inline comments. A senior developer must still approve the PR.

Why: Code review was the bottleneck in our delivery pipeline. Senior developers were spending 30% of their time reviewing junior code. The assistant handles the mechanical checks so senior devs can focus on architecture and logic.

Failure mode: The assistant approved a PR that passed all mechanical checks but introduced a business logic error in marketplace commission calculations. It calculated seller payouts at the wrong tier. A senior developer would have caught the domain error. We now require business logic sign-off as a separate gate from code quality.

C008 MEDIUM MEASURED RESULT 6x efficiency

The Sales Intelligence Agent monitors HubSpot pipeline, enriches incoming leads with firmographic data, scores them against our ICP (B2B distributors/manufacturers with $50M+ revenue, existing marketplace aspirations), and routes qualified leads to the CRO's team with a priority score.

Why: Our CRO Ed Coke has sold over $1B in commerce services. His time should be spent on $500K+ opportunities, not qualifying $30K requests. The agent handles triage.

Failure mode: The scoring model initially weighted company size too heavily and deprioritized a mid-market chemical distributor that turned into our ChemDirect engagement -- one of our highest-profile marketplace launches. We added "marketplace intent signals" as a scoring factor.

C009 MEDIUM OBSERVED REPEATEDLY 4x efficiency

The Marketplace Analyst (AI agent) monitors live marketplace deployments for our managed services clients -- tracking seller onboarding velocity, GMV trends, catalog health, and commission anomalies. It generates weekly health reports for account managers. It does not modify marketplace configurations or contact sellers directly.

Why: Clients on our managed marketplace services expect proactive issue detection. A marketplace with degrading seller health metrics needs intervention before sellers churn, not after.

Failure mode: The analyst flagged a "GMV decline" that was actually a seasonal pattern (post-holiday normalization). The account manager escalated unnecessarily, alarming the client. We now require 4-week rolling comparisons against same-period prior year before flagging GMV declines.

C003 HIGH OBSERVED ONCE 5x efficiency

The ad monitoring agent reports anomalies. It does not diagnose causes or recommend actions.

Why: When the monitor started including "likely cause: audience fatigue" in its alerts, the account manager skipped her own analysis and forwarded the agent's guess to the client. The actual cause was a Meta algorithm change affecting all lead gen campaigns that week. The agent's guess was plausible but wrong. The client made creative changes they didn't need.

Failure mode: Agent speculation is treated as diagnosis. Humans skip their own investigation. Clients act on wrong information.

C004 HIGH OBSERVED REPEATEDLY 7x efficiency

Each agent has a named human owner who reviews its output weekly and owns its calibration.

Why: During our rapid scaling from 2 to 8 agents, agents 5 through 8 had no clear human reviewer. The pipeline tracking agent miscategorized "proposal viewed" as "deal progressing" for 3 weeks. Nobody caught it because nobody was looking. $42K in pipeline was listed as "warm" when it was actually stale proposals auto-opened by email preview panes.

Failure mode: Unreviewed agents drift. Miscalibrations compound silently. Bad data enters decision-making.

C016 MEDIUM OBSERVED ONCE 3x efficiency

The pipeline agent tracks deal stage transitions. It does not predict close probability or recommend next actions.

Why: When we added "recommended next step" to the pipeline agent's output, it suggested "send a follow-up proposal" for a deal where the prospect had explicitly asked for two weeks to think. The account manager sent the follow-up. The prospect replied: "I said two weeks." The deal still closed, but the relationship started on the wrong foot.

Failure mode: Agent recommendations override human judgment about relationship timing. Prospects feel pressured by automation-driven cadence.

C005 HIGH OBSERVED ONCE 5x efficiency

Scout must distinguish between verified facts (sourced from named publications, filings, or databases) and inferences (derived from pattern analysis or synthesis). Every claim in a research brief must be tagged [VERIFIED] or [INFERRED].

Why: The founder presents research findings to C-suite audiences who will challenge sources. An untagged inference presented as fact destroys credibility. Worse, the founder may not know which claims are inferred until challenged in a live meeting.

Failure mode: Scout produced a competitive analysis stating that a competitor "plans to enter the Southeast market in Q3." This was an inference from job postings and real estate filings, not a verified plan. The founder presented it as fact in a board meeting. The client's CEO had dinner with that competitor's CEO the following week and asked about the Southeast expansion. The competitor had no such plans. The founder lost standing with the board.

C006 HIGH OBSERVED ONCE 5x efficiency

Forge must produce deliverables at 70% completion. The remaining 30% must be clearly marked with [FOUNDER INPUT NEEDED] tags. Forge must never attempt to write the "insight" sections -- the strategic conclusions that the client is paying $350/hr for.

Why: Over-complete drafts trigger rubber-stamping behavior. The founder skims instead of reads, approves instead of edits, and the deliverable goes out without the founder's genuine strategic thinking. The client eventually notices they're getting generic strategy wrapped in a premium package.

Failure mode: Forge produced a 95%-complete strategy deck for a healthcare client. The founder made minor edits and sent it. The client's COO replied: "This reads like something I could get from any consultant. Where's the insight we're paying you for?" The founder realized Forge had written the recommendation section, and the founder had approved it without adding original thinking. Lost the renewal.

C007 HIGH OBSERVED REPEATEDLY 7x efficiency

Prep must include a "landmines" section in every meeting prep document: topics that could be sensitive, unresolved issues from prior interactions, and relationship dynamics the founder should be aware of.

Why: Solo consultants carry dozens of client relationships in their head. Context switching between 3-4 client meetings in a day means details slip. The landmines section is the most valuable part of the prep -- it prevents the founder from accidentally stepping into a known sensitive area.

Failure mode: Founder walked into a client meeting and casually asked about a project that the client had canceled two months prior due to budget cuts. The team lead who had championed the project (and fought internally for it) was visibly uncomfortable. Prep had included the project in the "current initiatives" section without noting the cancellation or the politics around it.

C008 HIGH OBSERVED ONCE 5x efficiency

Tempo must never reference specific deliverable content in follow-up emails. Follow-ups reference meetings and action items only. Deliverable details are shared through dedicated channels (email attachments, shared drives), not casual follow-up notes.

Why: Follow-up emails get forwarded. If Tempo includes a strategic recommendation in a follow-up, it may reach people who weren't in the room and lack the context to interpret it correctly.

Failure mode: Tempo drafted a follow-up that included "as discussed, we recommend divesting the Cleveland operation." The recipient forwarded the email to the full leadership team as a meeting summary. The Cleveland GM saw it before the client's CEO had a chance to frame the recommendation. Internal politics exploded. The founder spent 6 hours on damage control calls.

C006 MEDIUM INFERENCE 2x efficiency

Protocol Steward owns format spec, merge protocol, and architecture.

Why: Protocol needs a dedicated guardian for consistency.

Failure mode: Format quality drifts. Schema bloats.

C007 MEDIUM INFERENCE 2x efficiency

Market Intelligence owns competitive scanning and content drafting. Cannot send without approval.

Why: Market awareness must be continuous. External comms must be approved.

Failure mode: Competitive threats go undetected, or wrong messages reach prospects.

C008 MEDIUM HUMAN DEFINED RULE 3x efficiency

Revenue Analyst activates only when revenue exists (Phase 3).

Why: Nothing to track until revenue exists.

Failure mode: Premature activation produces meaningless reports.

C004 HIGH OBSERVED ONCE 5x efficiency

The client feedback synthesis agent organizes, categorizes, and groups client feedback by theme. It preserves exact client quotes in quotation marks and never paraphrases or interprets client intent.

Why: Client words carry nuance that paraphrasing destroys. "I don't love the blue" is different from "I hate the blue" is different from "Can we try other colors?" Each implies a different creative response.

Failure mode: The feedback agent summarized a client's feedback as "Client wants warmer colors." The actual quotes were "I like where this is going but the blue feels a bit corporate" and "Could we try something that feels more approachable?" Kai designed a warm-palette revision. The client said "I didn't say I wanted warm colors, I said it felt corporate." The mismatch cost a revision cycle (8 hours of design time, ~$1,200).

C005 HIGH OBSERVED REPEATEDLY 7x efficiency

The timeline management agent builds project schedules with mandatory creative buffer: 20% for brand identity projects, 15% for campaign projects, 10% for production work. The buffer is visible to the internal team but not to the client.

Why: Creative work is inherently unpredictable. A logo that clicks on the first round of concepts is done in 3 days. A logo that requires 5 rounds of exploration takes 3 weeks. Without buffer, the team is always behind.

Failure mode: Before mandatory buffer, the timeline agent scheduled a brand identity project based on "best case" estimates. The project hit 3 revision rounds (normal). By round 2, the timeline was blown. Diego had to call the client and push the deadline 10 days. The client's product launch was affected. Prism's next 2 proposals from that client were declined.

C006 HIGH OBSERVED ONCE 5x efficiency

The proposal writing agent drafts proposals in Mara's voice based on a voice library of 40+ approved proposals. The agent handles structure, pricing logic, and scope definition. Mara reviews and personalizes every proposal before it goes to a client.

Why: Proposals are sales documents. Mara's voice -- direct, confident, specific to the client's problem -- is what converts. An agent that produces generic proposals loses the personal touch that wins boutique clients.

Failure mode: An early proposal draft used generic agency language ("We'll leverage our deep expertise in brand strategy to deliver measurable outcomes"). Mara's actual voice: "Here's what we'll do for you, and here's exactly what it will cost." The generic version lost a pitch. Mara rewrote it and won the re-pitch 3 weeks later, but the delay almost cost the $28K project.

R3V Founding gold
C003 HIGH HUMAN DEFINED RULE 5x efficiency

Orchestrators own routing, delegation, and controlled execution; workers own narrow task completion; reviewers or review-like components own consolidation or quality checks; clockwork agents own recurring maintenance.

Why: Authority is aligned to agent class. The current stack shows orchestrators such as Sage and the Org Master, workers such as Lens, Scribe, Seeder, and specialists, and clockwork-style recurring maintenance patterns through Knowledge Connector and scheduled routines.

Failure mode: If worker agents gain orchestration authority or orchestrators are given ambiguous execution rights, governance becomes inconsistent and incident root cause becomes harder to trace.

C008 HIGH HUMAN DEFINED RULE 5x efficiency

Read-only question answering about CRM and service data is treated as its own governed capability, separate from operational automation.

Why: The GHL Agent was created specifically as a read-only worker for broad questions across contacts, opportunities, conversations, appointments, and related CRM records.

Failure mode: If analytical read access and operational authority are mixed, users may assume a data-answering agent can also take action, increasing accidental writes and false expectations.

C003 HIGH OBSERVED ONCE 5x efficiency

The campaign audit agent produces findings. The media buyer produces action plans. Never the reverse.

Why: The audit agent's monthly report included "Recommendation: pause broad match keywords, switch to phrase match." The media buyer disagreed -- broad match was performing because of Smart Bidding context signals the agent couldn't see in the API data. The founder sided with the agent's recommendation because "the AI analyzed it." The media buyer implemented the change reluctantly. CPL increased 40% in the first week. The media buyer reverted and was frustrated for a month.

Failure mode: Agent recommendations override human specialist judgment. Practitioners lose authority. Morale drops. Performance suffers.

C004 HIGH OBSERVED ONCE 5x efficiency

The inbox assistant categorizes emails and drafts responses. It does not promise timelines, deliverables, or budget changes.

Why: A dentist client emailed asking "can you increase my budget by $500 next month?" The inbox assistant drafted: "Absolutely, we'll get that adjusted for you starting the 1st." The founder approved the draft without reading carefully (morning rush, 14 emails to review). The $500 increase was implemented, but the dentist's credit card on file had a $5K limit and the increased spend triggered a card decline 3 weeks in, pausing the campaign for 2 days.

Failure mode: Agent drafts commit to operational changes. Hurried human approval lets commitments through. Downstream billing or operational conflicts surface days or weeks later.

Sneeze It Founding gold
C005 HIGH OBSERVED REPEATEDLY 7x efficiency

The Reporting Agent owns weekly performance summaries. The Spend Monitor owns daily pacing alerts. They never overlap. The Reporting Agent does not alert on daily spend. The Spend Monitor does not summarize weekly trends.

Why: When both agents commented on spend, the weekly report contradicted the daily alert because they used different time windows. Strict lane separation fixed it within one day.

Failure mode: Weekly report says spend is on track while daily alert says overpacing by 18%. Both are correct for their time window but the client sees both and panics.

C006 MEDIUM HUMAN DEFINED RULE 3x efficiency

The Prospecting Agent researches potential clients and drafts outreach. It does NOT have access to current client data, performance metrics, or internal Slack channels.

Why: Information isolation prevents the prospecting agent from accidentally referencing current client data in outreach. It also prevents scope creep into account management territory.

Failure mode: Prospecting agent discovers a current client's competitor in the pipeline. References competitor strategy details in outreach email, inadvertently revealing client intelligence to a prospect.

C007 MEDIUM OBSERVED ONCE 3x efficiency

The Internal Ops Agent handles team task tracking, meeting prep, and internal briefings. It is the only agent that reads the project management tool. Other agents request project status through its state file.

Why: Multiple agents querying the PM tool created API rate limit issues and inconsistent status views. Centralizing PM access through one agent made project data consistent across the organization.

Failure mode: Three agents query Asana simultaneously. Rate limit hit. Two get stale cached data, one gets current. Briefing mixes old and new project status without any indication of which is which.

C008 HIGH OBSERVED REPEATEDLY 7x efficiency

Each agent has a written one-line role, a list of what it owns, and an explicit list of what it does NOT own. Authority boundaries are documented, not implied.

Why: Without explicit boundaries, agents drift into overlapping responsibilities. Implicit ownership creates scope conflicts.

Failure mode: Two agents both track project status. Conflicting updates confuse the team. Neither agent knows the other is updating.

C009 HIGH MEASURED RESULT 10x efficiency

The Call Center Manager agent manages 3 human employees through daily Slack messages. It reads performance data, drafts coaching messages in the founder's voice, and sends via the founder's Slack account after approval. The humans do not know it is AI.

Why: Data-driven daily coaching at the individual level was not possible with a human manager at this team size. The AI manager processes call stats, identifies patterns, and delivers specific, numbered feedback daily. After 6 days of AI management, the former human manager was transitioned to a caller position based on performance data showing the AI produced more consistent, data-backed coaching.

Failure mode: If coaching messages sound generic or robotic, human employees disengage. Messages must be varied, specific, human-sounding, and data-backed. Formulaic messages degrade performance within 3 days.

C010 MEDIUM OBSERVED REPEATEDLY 4x efficiency

The Evaluator agent scores system maturity against a published 8-level framework. It identifies the single highest-impact bottleneck and hands it to the Learning agent. The Learning agent implements. The Evaluator re-scores. This loop runs without the founder in the middle.

Why: Self-improvement requires both diagnosis and action. Separating evaluation from implementation prevents self-grading bias. The loop IS a demonstration of the maturity it measures.

Failure mode: Evaluator diagnoses correctly but the implementer fails to execute. Score stagnates. Or: implementer makes changes the evaluator hasn't requested, creating drift.

Stackwise silver
C005 MEDIUM MEASURED RESULT 6x efficiency

The Support Agent owns issue resolution. The Onboarding Agent owns the first 30 days. During onboarding, support tickets route to Onboarding, not Support.

Why: New customers who hit issues in the first 30 days need different urgency than established customers. Support's standard 4-hour response time is too slow for someone deciding whether to stay.

Failure mode: Day 3 customer submits ticket. Support responds in 4 hours with standard template. Customer expected white-glove onboarding. Churns before day 7.

C006 HIGH MEASURED RESULT 10x efficiency

Engineering Alerts agent monitors production and pages on-call. It reports symptoms only: what broke, when, how many affected. It does not diagnose or suggest fixes.

Why: Early version included "probable cause" in pages. Wrong 60% of the time. Engineers spent first 20 minutes chasing the agent's incorrect diagnosis instead of investigating actual symptoms.

Failure mode: Agent pages: "Database pool exhausted, probable cause: recent migration." Engineer rolls back migration. Actual cause: leaked connection in new feature. Rollback useless. 40 minutes wasted.

C007 MEDIUM OBSERVED ONCE 3x efficiency

Content Agent drafts posts, changelogs, and social. It does not publish directly. Founder reviews for voice, engineering reviews for technical accuracy.

Why: Content agent claimed a feature "uses machine learning" when it was rule-based. Technical user called it out on Hacker News. Embarrassing correction needed.

Failure mode: Agent overstates technical capabilities. Published without technical review. Community catches it. Credibility damaged.

C005 HIGH OBSERVED ONCE 5x efficiency

The appointment agent owns reminders and no-show prediction. The education agent owns patient handouts and condition-specific content. The onboarding agent owns staff training docs. No agent touches another agent's domain.

Why: In week 4, the education agent started generating "appointment preparation" content that overlapped with the appointment agent's pre-visit reminders. Patients received two messages for the same visit with slightly different instructions (one said "fast for 12 hours," the other said "fast for 8 hours"). Front desk fielded 6 confused calls in one morning.

Failure mode: Two agents send conflicting pre-visit instructions. Patient follows the wrong one. Lab results are invalid. Patient has to return for a redraw, losing a half-day of work. Patient leaves a 1-star review mentioning "they can't even get their own instructions straight."

C006 MEDIUM OBSERVED ONCE 3x efficiency

The onboarding agent generates training documentation only. It does not schedule training sessions, assign mentors, or modify the HR system. Those actions require Tanya.

Why: Scope creep is real. The onboarding agent started drafting "suggested training schedules" that staff interpreted as actual assignments. New MA showed up for a shadowing session that had never been confirmed with the supervising physician. Physician was mid-procedure.

Failure mode: Agent suggests a training schedule. New hire treats it as confirmed. Shows up to shadow Dr. Okafor during a procedure that requires focused attention. Disruption, awkwardness, and Dr. Okafor loses 15 minutes explaining the situation.

C007 HIGH OBSERVED ONCE 5x efficiency

The education agent uses only physician-approved medical references (UpToDate, CDC guidelines, ADA standards) as source material. It does not synthesize from general web content or training data alone.

Why: Medical accuracy is non-negotiable. The agent generated a diabetes management handout that included a dietary recommendation sourced from its training data rather than current ADA guidelines. The recommendation had been updated 8 months prior. Dr. Pham spent 20 minutes correcting it.

Failure mode: Agent generates content based on outdated training data. Handout recommends a medication dosing schedule that was revised 6 months ago. Patient follows outdated guidance. Best case: ineffective treatment. Worst case: adverse event.

C005 HIGH OBSERVED ONCE 5x efficiency

Customer onboarding agent configures new customer environments (API keys, rate limits, model access) but cannot modify existing customer configurations. Changes to live customers require a human engineer.

Why: Misconfigured rate limits or model access on a live customer can either expose them to overcharges or lock them out of their own eval pipeline.

Failure mode: Onboarding agent was given modify access to "fix" a new customer's rate limit. It updated the wrong customer's config (adjacent row in the customer table). The existing customer's rate limit dropped from 10K/min to 100/min. Their eval pipeline queued 9,900 requests and started timing out. Their monitoring system fired alerts to their SRE team at 2 AM. They filed a P1 incident against Synthwave. Resolution took 4 hours. The customer demanded a 1-month service credit ($4,200).

C006 HIGH OBSERVED ONCE 5x efficiency

Eval pipeline monitoring agent watches system health metrics (latency, error rates, queue depth) but never triggers automated remediation. It alerts the on-call engineer via PagerDuty.

Why: Automated remediation in an AI eval pipeline can mask deeper issues. A restart that clears a queue also destroys in-flight eval results that customers are waiting on.

Failure mode: Before the no-remediation rule, the monitoring agent auto-restarted a stalled eval worker. The restart cleared 340 in-flight eval jobs. 12 customers saw "eval failed" errors for jobs they'd submitted. 4 customers re-submitted, creating duplicate results. 2 customers used the duplicate results in production decisions before realizing the duplication. The engineering team spent 3 days deduplicating and verifying results across all affected customers.

C007 HIGH OBSERVED ONCE 5x efficiency

Incident response agent drafts customer-facing status page updates and direct communications but all messaging requires approval from both the CTO (Rohan) and the CEO (Lina) before publication.

Why: Incident communications set legal and contractual precedents. Saying "data was not compromised" when it was creates liability. Saying "data may have been compromised" when it wasn't creates unnecessary panic.

Failure mode: Agent drafted a status update that said "no customer data was affected" during the performance review pipeline incident. In reality, internal performance data was exposed to 3 customers -- not "customer data" in the traditional sense, but the customers disagreed with that characterization. If the statement had been published, it would have been contradicted by the customers' own screenshots. Lina rewrote the update to acknowledge that "internal operational data was briefly visible in a shared staging environment."

C008 MEDIUM OBSERVED ONCE 3x efficiency

Sales demo prep agent builds demo environments with synthetic data only. Never use real customer data, even anonymized, in sales demonstrations.

Why: "Anonymized" data has been de-anonymized by prospects in at least one competitor's sales demo. Enterprise security teams look for patterns in demo data to identify existing customers.

Failure mode: Demo agent used anonymized customer eval data in a demo. A prospect's security engineer recognized the model architecture patterns and said: "This looks like [specific customer]'s eval setup." The salesperson denied it, but the prospect told the customer. The customer called Rohan within 2 hours. The relationship survived only because the data was genuinely anonymized and the customer's security team verified it. But the 3-hour investigation cost was real.

C009 MEDIUM OBSERVED ONCE 3x efficiency

Competitor analysis agent scrapes only public information (pricing pages, documentation, blog posts, changelog). It never accesses competitor products using credentials, free trials, or any method that could be construed as unauthorized access.

Why: Synthwave's customers include companies that also sell AI tools. Unauthorized access to a competitor could become a breach of trust with mutual customers.

Failure mode: No direct failure, but the competitor analysis agent once signed up for a competitor's free trial using a disposable email. The competitor's sales team traced the signup to Synthwave's IP range and called Rohan: "Your team is reverse-engineering our product." Rohan hadn't known about the signup. The call was awkward but the competitor accepted the explanation. Free trial access was permanently revoked from the agent's capabilities.

C005 HIGH OBSERVED ONCE 5x efficiency

Vigil (ad monitoring) must flag any single-day spend that exceeds the daily budget by more than 15%, any 3-day rolling average that exceeds the weekly budget pace by more than 10%, or any campaign with CPA rising more than 25% week-over-week. Flags go to #ad-alerts in Slack with specific numbers, not vague warnings.

Why: Small brands can't absorb runaway ad spend. A Meta campaign with a broken audience that runs over the weekend can burn $2,000 before anyone notices Monday morning. Vigil is the smoke detector.

Failure mode: A Google Shopping campaign's daily spend jumped from $85 to $340 on a Friday. Vigil's original threshold was 50% (too loose). Nobody checked until Monday. Total weekend overspend: $680 against a $250 weekend budget. The 15% threshold was set after this incident.

C006 HIGH OBSERVED ONCE 5x efficiency

Rhythm (email campaigns) must never send to the full subscriber list without explicit human approval. Segmented sends to lists under 2,000 contacts may be queued for review. Any send over 2,000 contacts requires the founder or marketing coordinator to approve in Slack.

Why: An email to 28,000 subscribers with a typo in the subject line, wrong discount code, or broken product link is an immediate brand event. It's also irreversible. Small segmented sends limit the blast radius of any error.

Failure mode: Rhythm queued a "flash sale" email to the full list with a 40% discount code that was supposed to be limited to VIP customers (top 500). The marketing coordinator caught it 20 minutes before the scheduled send. If it had gone out, the margin impact on a 2-day flash sale at 40% off to the full list would have been approximately $14,000 in margin erosion.

C007 MEDIUM OBSERVED ONCE 3x efficiency

Shade (competitor pricing) must report price changes as facts, not recommendations. Shade may note that Competitor X dropped their heavyweight tee from $48 to $38. Shade must NOT recommend that Threadline match the price. Pricing decisions are founder-only.

Why: Race-to-the-bottom pricing destroys DTC margins. An agent that sees a competitor price drop and recommends a match is optimizing for competitiveness at the expense of profitability. The founder understands brand positioning, customer willingness to pay, and margin floors in ways the agent cannot.

Failure mode: Early in deployment, Shade recommended matching a competitor's 30% price cut on joggers. The founder almost followed the recommendation before realizing the competitor was liquidating the style (discontinuing it). Matching would have eroded $6/unit margin on a strong-performing SKU for no competitive reason.

C008 HIGH OBSERVED ONCE 5x efficiency

Rebound (returns) must verify order eligibility (within return window, item category eligible, purchase channel eligible) before drafting a return authorization. If any eligibility check fails, Rebound must explain why and offer to escalate to the CS rep -- not override the policy.

Why: Consistent policy enforcement builds trust. Customers who see inconsistent return decisions assume the system is arbitrary and push harder. Rebound must be the most consistent enforcer in the system.

Failure mode: Rebound approved a return for a final-sale item because the customer's tone was upset and Rebound's empathy heuristic overrode the eligibility check. The item was a limited-edition collaboration piece bought at 50% off. The return cost the business $34 in margin plus return shipping. More importantly, it set a precedent that other customers referenced ("but my friend got a return on a final sale item").

C004 HIGH OBSERVED REPEATEDLY 7x efficiency

The deal memo agent produces analysis documents only. It presents market data, comparable transactions, risk factors, and financial projections with assumptions clearly stated. It never uses evaluative language ("good deal," "attractive returns," "strong upside").

Why: Deal memos are distributed to accredited investors as part of the offering process. Evaluative language transforms an informational document into a solicitation, which has different regulatory requirements.

Failure mode: An early deal memo draft described a property as having "exceptional upside potential." Tomasz caught it during associate review. "Exceptional" implies a quality judgment about the investment. Replaced with "Projected returns under base case assumptions are detailed in Exhibit C."

C005 HIGH OBSERVED ONCE 5x efficiency

The compliance document agent generates templates and first drafts only. Every document (PPMs, subscription agreements, side letters) requires Chen's line-by-line review and outside counsel signoff for any deal-specific terms.

Why: Securities documents with errors can invalidate an offering. A subscription agreement with the wrong minimum investment amount or missing accredited investor representations creates legal exposure for every investor in the deal.

Failure mode: The compliance agent generated a subscription agreement draft using terms from Deal A applied to Deal B. The minimum investment was listed as $50,000 (Deal A's minimum) instead of $100,000 (Deal B's minimum). Chen caught it, but noted it was "the kind of error that could let an unqualified investor into a deal if missed."

C006 HIGH OBSERVED ONCE 5x efficiency

The market research agent produces research briefs labeled "FOR INTERNAL USE ONLY" by default. Any research intended for investor distribution requires the 3-step review chain (C001) and additional disclaimers drafted by Chen.

Why: Market research shared with investors becomes part of the offering materials. Statements about market conditions ("We see strong demand for multifamily in the Southeast") can be construed as forward-looking projections or general solicitation.

Failure mode: Sarah forwarded an internal market research brief to an LP who had asked about the Southeast multifamily market. The brief was not investor-grade -- it contained language like "We're bullish on this sector" and had no disclaimers. Chen was not involved. The LP's family office attorney flagged the language and requested Upside's compliance policies. No regulatory action, but Chen implemented the internal-only default label immediately.

Vetted Goods silver
C005 HIGH OBSERVED REPEATEDLY 7x efficiency

Chorus (content) must load the brand-specific voice guide before every generation task. Voice guides are stored per brand: `/voice/ridgeline-v4.md`, `/voice/forma-v2.md`, `/voice/copper-v3.md`. Chorus must confirm which voice guide is loaded in its output header.

Why: Voice bleed is the most common multi-brand failure. It's subtle -- a single adjective that feels "off brand" is easy to miss in review. The voice guide load confirmation forces both the agent and the reviewer to verify brand alignment.

Failure mode: Beyond the adventure-tee incident (C001), Chorus generated Forma Daily Instagram captions using Ridgeline's voice guide 3 separate times over 2 months. Each time, the captions were "fine" but didn't sound like Forma. The marketing coordinator approved them because the content wasn't wrong, just slightly off. Customer engagement on those posts was 30-40% below Forma's average. The correlation wasn't identified until a quarterly content audit.

C006 HIGH OBSERVED ONCE 5x efficiency

Harbor (CS) must respond using the specific brand's CS template set. Ridgeline's CS voice is friendly/outdoorsy ("Happy to help! Let's get this sorted."). Forma's is minimal/efficient ("Here's what we can do."). Copper & Thread's is warm/premium ("We want to make this right for you."). Harbor must never use one brand's tone for another.

Why: CS interactions are the highest-touch brand experience. A Copper & Thread customer paying $180 for a leather wallet expects a premium service experience, not a casual "no worries, dude" response borrowed from Ridgeline's template.

Failure mode: Harbor responded to a Copper & Thread complaint about a defective zipper with Ridgeline's casual tone: "Bummer about the zipper! We'll get you a new one ASAP." The customer replied: "I paid $180 for this bag. 'Bummer' is not the response I expected." The customer posted the exchange on Instagram (1,200 views). The ops manager rewrote the response and personally followed up.

C007 HIGH OBSERVED ONCE 5x efficiency

Atlas (inventory) must maintain separate demand models per brand and per sales channel (Shopify direct, Amazon, wholesale). Cross-brand inventory sharing (Brand A slow-mover restocked as Brand B product) requires founder approval.

Why: Each brand's demand patterns are driven by different factors. Ridgeline is seasonal (Q4 heavy, summer slow). Forma is steady year-round. Copper & Thread spikes around holidays. A model that averages across brands produces forecasts that are wrong for all three.

Failure mode: Atlas used a pooled demand model during its first month. The model predicted steady demand for Ridgeline in July (because Forma's steady demand pulled the average up). Ridgeline actually dropped 35% in summer. The team ordered based on Atlas's projection and ended up with $22,000 in excess Ridgeline summer inventory that had to be marked down 40% in August.

C008 MEDIUM OBSERVED REPEATEDLY 4x efficiency

Prism (competitive intel) must maintain separate competitor sets per brand. Ridgeline competes with Patagonia, REI, Cotopaxi. Forma competes with Everlane, Uniqlo, COS. Copper & Thread competes with Bellroy, Shinola, Ghurka. Prism must never surface Ridgeline competitor data in a Forma context.

Why: Competitive intelligence only has value when it's contextually relevant. A price drop from Everlane is critical for Forma's pricing strategy and completely irrelevant to Ridgeline. Cross-brand competitor noise makes the team ignore the signal.

Failure mode: Prism reported Cotopaxi's new product launch in a Forma competitive update. The Forma marketing coordinator spent 45 minutes evaluating whether to respond before realizing Cotopaxi is outdoor wear, not minimalist basics. The error wasn't damaging, but it eroded trust in Prism's relevance. The marketing coordinator started skipping Prism reports entirely for 3 weeks.