Failure Patterns

C011 HIGH OBSERVED ONCE 5x efficiency

Gave analytics agent write access to campaigns. It optimized for wrong metrics.

Why: Lacked client context.

Failure mode: Decreased spend on strategic brand campaign.

C012 HIGH MEASURED RESULT 10x efficiency

Single shared state file became bottleneck and corruption source.

Why: Concurrent writes caused data races.

Failure mode: Two agents update simultaneously. One update lost.

C010 MEDIUM OBSERVED REPEATEDLY 4x efficiency

When GPT generates content that fails fact-checking, log the failure type (fabricated claim, wrong client data, prohibited language, tone mismatch) and review monthly for patterns.

Why: After 3 months of logging, we found that 62% of GPT fact-check failures were fabricated social proof -- testimonials, case study numbers, and "as seen in" claims that didn't exist. Armed with this pattern, we added a pre-generation instruction to GPT: "Do not generate testimonials, case study results, or media mentions unless they appear verbatim in the client fact sheet." Fabricated social proof failures dropped 84% the next month.

Failure mode: Without categorized failure logging, the same error types recur. Generic "be more accurate" prompting doesn't target the specific failure mode.

C011 MEDIUM OBSERVED ONCE 3x efficiency

When a cross-model handoff fails, the receiving model must reject the input and report the schema violation. It must never improvise with missing fields.

Why: The creative brief schema requires a "tone" field (professional, casual, urgent, educational). When a brief arrived without the tone field due to a schema version mismatch, GPT defaulted to "casual" -- its training default. The client was a law firm. The generated ad copy opened with "Hey there! Need a lawyer?" The account manager caught it, but the failure revealed that missing fields trigger model defaults rather than errors.

Failure mode: Missing schema fields are silently filled by model defaults. Defaults reflect training distribution, not client requirements. Casual tone is GPT's most common training context.

C012 MEDIUM OBSERVED ONCE 3x efficiency

API failures on one platform (Meta or Google) must not block reporting on the other platform. Each platform's monitoring runs independently.

Why: An early architecture decision chained Meta and Google monitoring sequentially. When Meta's API went down for 4 hours on a Tuesday morning, Google Ads monitoring was also blocked because it waited for Meta to complete. We missed a Google Ads account that had exhausted its daily budget by 9 AM due to a bidding error. Cost: $1,100 in wasted spend before the media buyer checked manually at noon.

Failure mode: Sequential dependencies between independent data sources create cascading failures. One platform's outage blinds monitoring on unrelated platforms.

C013 HIGH OBSERVED REPEATEDLY 7x efficiency

When an agent error impacts a franchisee, the corporate team is notified within 1 hour and the franchisee receives a personal call within 4 hours. Agent errors are not communicated via email or automated message.

Why: Franchisees pay franchise fees. An impersonal response to an agent-caused error signals that corporate doesn't take the relationship seriously. Two franchisees cited "lack of responsiveness to marketing errors" as a factor in non-renewal discussions.

Failure mode:

C014 MEDIUM OBSERVED ONCE 3x efficiency

Any agent that produces a cross-location data leak (member PII visible outside its home location) triggers an immediate 24-hour audit of all cross-location reports produced in the prior 30 days.

Why: A single leak may indicate a systemic template error. Catching it early prevents regulatory exposure.

Failure mode: The C003 incident revealed that 3 other report templates had similar location-name-in-header issues. The 24-hour audit caught them before they were distributed.

C015 MEDIUM OBSERVED REPEATEDLY 4x efficiency

Campaign launch failures (wrong creative, wrong audience, wrong location) require a root cause analysis within 48 hours. The analysis must identify whether the failure was data (wrong input), logic (wrong rule), or coordination (right data, wrong handoff).

Why: Without categorizing failures, fixes address symptoms. The C001 promo conflict was initially blamed on "bad creative" when the root cause was a missing coordination protocol between agents.

Failure mode:

C012 HIGH OBSERVED ONCE 5x efficiency

If an agent error touches a client relationship, Marcus personally reaches out within 24 hours. The agent does not attempt to correct its own mistake in client-facing contexts.

Why: Automated error correction looks worse than the original error. A human apology preserves trust.

Failure mode: Timeline agent sent a project update with the wrong delivery date (confused two projects with similar names). Before Marcus could intervene, the agent sent a correction email. Client replied: "How many robots are running this?" Marcus lost 2 hours on damage control.

C013 HIGH OBSERVED REPEATEDLY 7x efficiency

Agent errors involving incorrect client data (wrong name, wrong project, wrong dates) trigger an immediate audit of the data source, not just a correction of the output.

Why: Creative agencies juggle 8-12 active projects. Data cross-contamination between projects is the most dangerous failure mode.

Failure mode: Intake agent pulled revision notes from Project A into the brief for Project B because both clients had the same first name. The shot list was built on contaminated requirements. 4 hours of work scrapped.

C014 MEDIUM OBSERVED ONCE 3x efficiency

Never auto-archive or auto-close a project. Only Marcus marks projects complete.

Why: Creative projects have long tails. A "delivered" video might come back for re-edits 3 months later.

Failure mode: Timeline agent auto-archived a project 30 days after final delivery. Client came back for a re-edit. All the organized revision history and shot notes were in the archive. Took 45 minutes to restore and re-orient.

C013 HIGH OBSERVED ONCE 5x efficiency

The template modification incident (C001) was caused by the agent having write access to the template folder. The fix was simple: move templates to a locked folder with read-only permissions. The 22-hour cleanup was entirely preventable with proper access controls.

Why: The agent was trying to be helpful. It identified what it thought was an error and fixed it. In any other context, that initiative might be valued. In legal document assembly, unsupervised initiative is dangerous. Access controls are the only reliable safeguard against well-intentioned AI modifications.

Failure mode: Without folder-level access controls, any agent with file access can modify templates. The next modification might not be caught for months if it affects a rarely-used template (like the irrevocable life insurance trust). By then, dozens of documents could be affected.

C014 LOW OBSERVED REPEATEDLY 2x efficiency

Priya initially reviewed assembled documents by reading them end-to-end. This took 45 minutes per package and she still missed the template modification for 3 clients. The diff-check (C002) now catches structural changes automatically, and Priya focuses her 45-minute review on legal accuracy rather than template fidelity.

Why: Humans are poor at detecting subtle changes in dense legal text. Priya read the modified survivorship clause three times across three different trusts and did not notice because the change was plausible-sounding legal language. The agent did not make a typo. It made a legally coherent but incorrect modification.

Failure mode: Attorney reviews documents for obvious errors (misspellings, wrong names) but misses subtle legal modifications. Modified clause sounds correct to a quick read. Only surfaces during trust administration years later when the legal effect differs from the grantor's intent.

C015 LOW MEASURED RESULT 3x efficiency

The cost of the template incident was not just the 22 hours. One of the three affected clients moved to a different attorney. That client was worth approximately $4,800 in lifetime value (annual reviews plus referrals). Total cost: $6,600 in non-billable time plus $4,800 in lost client value. $11,400 from a single agent error.

Why: In a solo practice generating $190K/year, $11,400 is 6% of annual revenue. The entire agent implementation was projected to save $42K/year (replacing the need for a second paralegal). One error consumed 27% of the first year's projected savings.

Failure mode: Cascading cost of a single template error in a solo practice: direct remediation cost + client churn + referral loss + reputation damage in a small legal community. The financial impact is disproportionate to the size of the error.

C014 HIGH OBSERVED ONCE 5x efficiency

Any agent error involving student identity (wrong name, wrong data, wrong family) triggers a full audit of all recent outputs before any new communications are sent.

Why: The Jayden incident proved that a single identity error can cascade. If one record is wrong, others might be too.

Failure mode: After the Jayden name mixup, Keisha audited all 34 student records and found 2 additional minor data mismatches (wrong grade levels). If those had gone to parents, the trust damage would have been unrecoverable.

C015 MEDIUM OBSERVED ONCE 3x efficiency

Never batch-send parent communications. Send one at a time with Keisha reviewing each individually.

Why: Batch sending multiplies errors. One mistake in a batch template affects every family.

Failure mode: Keisha tried batch-sending progress reports on the first biweekly cycle. The template had the wrong date header. All 34 families received reports dated for the wrong week. 12 parents replied asking about the date. Keisha spent 90 minutes sending correction notices.

C009 HIGH OBSERVED ONCE 5x efficiency

When the code review agent cannot access a PR (private fork, permissions issue, deleted branch), it must report the failure, not skip the PR silently.

Why: A contributor opened a PR from a private fork. The code review agent couldn't access the fork's branch. It silently skipped the PR. The founder assumed "no review comments" meant the PR was clean. He merged it. The PR introduced a dependency with a known CVE. The code review agent would have flagged the dependency if it had been able to read the diff. Silent skip looked identical to clean review.

Failure mode: Access failures produce the same output as "nothing to report." The reviewer cannot distinguish between "reviewed and clean" and "not reviewed."

C010 MEDIUM OBSERVED ONCE 3x efficiency

Linear task creation from agent triage requires the founder's approval. The triage agent drafts Linear tasks; it does not create them.

Why: The triage agent created 23 Linear tasks in its first week from GitHub issues and Slack messages. Seven were duplicates. Four were feature requests the founder had already decided against. Two were from the same user filing multiple reports about expected behavior. The founder spent 45 minutes cleaning up Linear -- longer than manual triage would have taken.

Failure mode: Automated task creation from unfiltered input fills the task system with noise. Cleanup takes longer than manual curation. The task system stops being trustworthy.

C011 MEDIUM MEASURED RESULT 6x efficiency

Agent output quality must be measured against the time the founder saves, not the volume of output produced.

Why: The code review agent produced reviews for 100% of PRs. Impressive. But 70% of PRs were the founder's own code -- code he'd just written and already knew the issues with. The "time saved" on self-authored PRs was near zero. The agent was most valuable on contributor PRs (30% of volume) where the founder hadn't seen the code. We scoped the agent to contributor PRs only and saved the founder 20 minutes/day by eliminating the noise of reviewing his own reviews.

Failure mode: Agents optimize for coverage instead of value. Running on every input (including inputs the human already has context for) creates review overhead that exceeds the review benefit.

C016 LOW OBSERVED ONCE 1.5x efficiency

When the solo founder is unavailable for 24+ hours (vacation, illness), agents must queue output and pause any time-sensitive actions rather than accumulate unreviewed decisions.

Why: The founder took a 3-day weekend without pausing agents. He returned to 47 triaged issues, 12 code reviews, and 3 draft changelogs. The backlog took 2.5 hours to process. Worse, 2 P1 issues had been sitting in triage for 72 hours with users waiting for responses. The agents correctly triaged them as urgent but had no mechanism to escalate when the human wasn't responding. Now agents pause after 24 hours of no human interaction and send a single "review queue paused -- items waiting" notification.

Failure mode: Agents continue producing output when the solo human is unavailable. Backlog accumulates. Time-sensitive items age without escalation. The founder returns to a wall of decisions that should have been made 2 days ago.

C014 HIGH OBSERVED REPEATEDLY 7x efficiency

When an agent produces an output that contains information from the wrong engagement, treat it as a critical incident. Full audit: which agent, which data, how it crossed the boundary, and architectural fix. Not just a correction.

Why: Information barrier breaches in consulting are existential. A pattern of near-misses means the architecture is fundamentally flawed, not that you got unlucky.

Failure mode: After the Haldane/Orion incident, we initially just "corrected the document." The same class of leak happened again 3 weeks later with different clients. Only after treating it as a structural failure and redesigning the agent architecture (splitting Lens and Recon, implementing sequential processing) did the problem stop.

C015 HIGH OBSERVED REPEATEDLY 7x efficiency

Agent-generated content that sounds authoritative but is fabricated (hallucinated frameworks, invented statistics, nonexistent case studies) must be caught before client delivery. Every deliverable draft must be checked against Vault's source library.

Why: Consultants trust agent output more as they get comfortable. The fabrication rate is low enough to create a false sense of reliability but high enough to cause real damage when it slips through.

Failure mode: Beyond the "4D Transformation Framework" incident (C005), Lens cited a "McKinsey 2025 Industry Report" that does not exist in a market analysis. The consultant included it in the deliverable. The client's team tried to find the report and couldn't. Credibility damaged.

C010 HIGH OBSERVED ONCE 5x efficiency

When an agent triggers a member-facing action that results in a complaint, the entire outreach queue for that agent pauses until Jamie reviews and clears it.

Why: One bad message might be a fluke. Two bad messages in a row is a systemic problem. Pausing prevents compounding damage.

Failure mode: Before this rule existed, the retention agent sent 3 incorrect save offers in one week (stale data bug). By the third, Jamie's phone was ringing with upset members. Batch pause would have contained it to one.

C011 MEDIUM OBSERVED ONCE 3x efficiency

Mindbody API failures must be logged and surfaced immediately. Agents must not fall back to cached data for member-facing actions -- they must queue the action for retry.

Why: Mindbody has scheduled maintenance windows and occasional API outages. Agents acting on last-known-good data during outages caused the C002 incident.

Failure mode: During a 3-hour Mindbody outage, the scheduling agent used 6-hour-old data to recommend a class swap. The class had already been manually rescheduled by the location manager during the outage. Conflict created.

C017 HIGH OBSERVED REPEATEDLY 7x efficiency

Any agent error that reaches a member triggers a post-mortem within 24 hours. The post-mortem must identify root cause, not just symptoms, and produce a rule update.

Why: Without post-mortems, the same class of error repeats with different specifics. The C002 incident could have been prevented if the earlier "We miss you" email error (C001) had produced a proper cross-reference rule.

Failure mode:

C014 HIGH OBSERVED ONCE 5x efficiency

If any agent output is publicly attributed to AI (by a community member or accidentally), Kai responds honestly within 24 hours with a clear explanation of how he uses AI tools.

Why: Denial makes it worse. The developer community respects transparency and punishes dishonesty.

Failure mode: After the Discord bot detection incident, Kai initially said "I just happened to be up late." Two community members checked his GitHub commit history and showed he had no commits between midnight and 6 AM for the previous 3 months. The contradiction made the situation worse. When Kai finally explained his agent workflow, the community was supportive: "Just be upfront about it next time."

C015 MEDIUM OBSERVED ONCE 3x efficiency

Agent errors on enterprise-facing outputs (release notes, security advisories, support responses) trigger immediate manual review of all pending enterprise communications.

Why: Enterprise customers are 80% of revenue ($6.4K of $8K MRR). One bad enterprise interaction has 40x the revenue impact of one bad community interaction.

Failure mode: The "false breaking change" release note error (C007) triggered 3 enterprise emails. Post-review found that the same release notes draft also understated a real breaking change (listed as "fix" instead of "breaking"). If the enterprise customers had upgraded without realizing it was breaking, it would have caused production incidents for their users.

C016 MEDIUM OBSERVED ONCE 3x efficiency

When the docs generation agent introduces terminology inconsistencies, flag all docs pages using the conflicting term for batch correction. Never correct one page in isolation.

Why: Partial terminology fixes create a docs site where the same concept has two names. This is worse than consistent wrong terminology because users can't search for the right term.

Failure mode: The "middleware hooks" vs. "request interceptors" inconsistency (C006) was initially fixed on only the new page. For 3 weeks, the docs had both terms. A user filed an issue: "Are middleware hooks and request interceptors the same thing? Your docs use both." Kai spent 4 hours auditing every page and standardizing to one term.

C013 MEDIUM OBSERVED ONCE 3x efficiency

Any SOC 2 control deficiency caused by an agent triggers an immediate 72-hour remediation window. The agent is suspended from production until the fix is verified by Maya and the engineering lead.

Why: SOC 2 audit findings compound. One unresolved finding makes auditors scrutinize everything else more aggressively. Fast remediation keeps the audit clean.

Failure mode: The C001 raw transaction incident took 3 weeks to remediate because it wasn't treated as urgent. The auditor noted both the original incident AND the slow remediation as separate findings. Two findings from one incident.

C014 MEDIUM OBSERVED ONCE 3x efficiency

False positive churn predictions that result in user complaints are tracked as a separate metric. If false positive rate exceeds 15% of actioned predictions, the churn model is retrained before any further outreach.

Why: Users who are told "We noticed you haven't been active" when they are active feel surveilled. Each false positive costs more trust than a true positive gains.

Failure mode: See C005. The 8 angry replies from 45 actioned predictions (18% false positive rate) triggered a model retrain. The retrained model incorporated mobile app activity and reduced false positives to 4%.

C015 HIGH OBSERVED ONCE 5x efficiency

When a support ticket auto-response is wrong (user replies saying the automated response didn't help or was incorrect), the ticket is immediately re-routed to a human agent and the auto-response template is flagged for review.

Why: A wrong automated response followed by another wrong automated response makes the user feel trapped in a system that doesn't work.

Failure mode: A user reported a failed Stripe payment. The triage agent auto-responded with "Try reconnecting your bank account via Plaid." The issue was Stripe, not Plaid. The user replied "That's not the problem." The agent sent the same template again. The user tweeted about the experience. 340 impressions.

C014 HIGH OBSERVED ONCE 5x efficiency

The stale-data deadline incident (C002) taught us that any system relying on cached legal dates is a malpractice risk. We now audit the deadline agent weekly by comparing its output against a manual Clio pull. Discrepancy rate must be 0%.

Why: A single missed deadline can result in a malpractice claim that exceeds the case value. The $340K near-miss cost 0 dollars only because a paralegal caught it by coincidence. The expected cost of that failure mode is too high for any tolerance above zero.

Failure mode: Deadline agent reports 42 days remaining. Manual check shows 39 days. Three-day discrepancy on a case worth $340K. If no one catches it, the statute expires. Client sues the firm. Insurance premium increases. Bar complaint filed.

C015 HIGH OBSERVED ONCE 5x efficiency

We initially gave the demand letter agent access to all 85 active case files so it could learn from prior letters. It began cross-pollinating facts between cases. A draft for one client included medical details from a different client's file. The draft never left the firm, but it exposed a systemic risk.

Why: When an AI has access to multiple case files simultaneously, it can blend facts. In a law firm, blending client facts is a confidentiality violation even if it never leaves the building. Each case must be an isolated context.

Failure mode: Demand letter for Client A includes a medical procedure that happened to Client B. Attorney catches it during review. But if the attorney had been rushing and sent it, opposing counsel would see medical details for a different patient. HIPAA violation, ethics violation, and potential criminal liability.

C016 LOW OBSERVED ONCE 1.5x efficiency

The comms agent sent a scheduling email to a client who had been non-responsive for 60 days. The client had actually retained another firm and not informed us. The scheduling email went to an opposing party's client. The new firm filed a motion alleging improper contact.

Why: Former-client status must be checked before any automated outreach. The comms agent did not verify case status before scheduling. Clio showed the case as "active" because no one had updated it after the client switched firms.

Failure mode: Automated outreach to a former client who is now represented by opposing counsel. Motion for sanctions filed. $4,500 in legal fees to respond. Managing partner's time consumed for two weeks. Reputation damage with the local bar.

C014 HIGH OBSERVED ONCE 5x efficiency

The Fair Housing warning incident cost $2,200 in legal fees, 15 hours of Rachel's time revising processes, and an undetermined amount of reputation damage. The phrase "best views in the city" was 5 words. Total cost-per-word: $440.

Why: Fair Housing compliance is not about intent. Rachel did not intend to mislead. The agent generated language it learned was effective in real estate marketing. But "best" is a subjective superlative that cannot be substantiated. In real estate advertising, unsubstantiated claims are violations regardless of intent.

Failure mode: Without content guardrails, the listing agent optimizes for engagement rather than compliance. Superlatives drive clicks. They also drive complaints. A single complaint triggers a formal investigation that consumes weeks of the broker's time and creates a permanent record.

C015 LOW OBSERVED ONCE 1.5x efficiency

The price recommendation in the seller report (C007) was the most expensive "helpful" suggestion the AI ever made. The seller reduced her price by $14,500 based on an AI recommendation that her listing agent would not have made. The agent planned to recommend staging ($2,800 investment) that historically yields a 5-8% ROI in the Denver market.

Why: AI-generated recommendations carry perceived authority because they appear in an "official" report. Sellers treat them as data-driven conclusions, not suggestions. The listing agent's relationship-based advice gets overridden by a number in a report.

Failure mode: AI recommendation undermines agent strategy. Agent loses control of the pricing conversation. Seller follows the report instead of the agent. If the recommendation is wrong, the seller blames the brokerage. If it is right, the seller credits the AI and questions whether they need an agent.

C016 LOW OBSERVED ONCE 1.5x efficiency

The showing scheduler initially optimized for maximum showings per day without considering showing fatigue. It scheduled 9 showings in one day for a buyer. By showing 6, the buyer was overwhelmed and could not differentiate properties. The next day she could not remember which house had the updated kitchen.

Why: More showings is not better showings. Optimal showing count per session is 4-5 properties with a break in between. Above 6, buyers experience decision fatigue and either choose impulsively or delay choosing entirely.

Failure mode: Scheduler optimizes for throughput. Buyer sees 9 homes in one day. Cannot remember any of them clearly. Requests second showings on 4 properties. Four repeat showings that were avoidable. Agent time wasted. Sellers inconvenienced. Buyer frustrated.

C015 MEDIUM OBSERVED ONCE 3x efficiency

Skill and seat alignment can fail operationally even when the intended architecture is clear, so actual platform state must be verified after assignment attempts.

Why: Prior interaction history shows a failed skill assignment attempt involving Sophie and the Email and Calendar Ops skill before the configuration was confirmed.

Failure mode: The organization may believe a safety or procedure layer is active when it is not, leading to silent capability gaps and misleading assumptions about agent behavior.

C016 MEDIUM INFERENCE 2x efficiency

Duplicate or repeated lesson memories should be treated as a signal of memory hygiene issues and reviewed periodically.

Why: The current memory set includes repeated lessons about the user's formatting preferences and design preferences.

Failure mode: Memory duplication can clutter context, waste tokens, and make it harder to distinguish genuinely new learning from repeated storage artifacts.

C017 HIGH MEASURED RESULT 10x efficiency

Provider instability should be watched even when circuit breakers are closed, because a non-zero failure history can still indicate integration fragility.

Why: The current circuit breaker snapshot shows OpenAI closed but with recorded failures, which suggests past provider errors did occur.

Failure mode: If transient provider issues are ignored, troubleshooting starts too late and agent reliability may degrade unexpectedly under load or during critical workflows.

C014 HIGH OBSERVED REPEATEDLY 7x efficiency

Any PHI exposure incident -- even if caught before external disclosure -- must be documented, root-cause analyzed, and the architectural control that failed must be identified and fixed within 48 hours. PHI near-misses are treated with the same severity as actual breaches for internal process purposes.

Why: HIPAA enforcement trends show that OCR (Office for Civil Rights) increasingly evaluates systemic compliance, not just incident response. A practice that can demonstrate a near-miss program with root cause analysis and architectural fixes is in a stronger compliance position than one that only responds to actual breaches.

Failure mode: The first three months of operation produced 4 near-misses (C001, C002, C005, C006). Each was treated as a one-off correction. After implementing the near-miss severity protocol, the architectural redesign (C002) was fast-tracked and eliminated the root cause for all 4 categories of near-miss. Zero near-misses in the subsequent 8 months.

C015 HIGH OBSERVED ONCE 5x efficiency

When a new staff member joins and is trained on agent usage, they must complete a 30-minute HIPAA-and-agents training that covers: what PHI is, how agents work, why PHI must never enter a prompt, and how to report a suspected exposure. New staff are the highest-risk vector for PHI entering agent prompts.

Why: Clinical staff who are new to AI agents don't intuitively understand that typing a patient name into a prompt is different from writing it in a chart. The mental model of "the computer knows how to keep things private" doesn't apply to LLM-based agents.

Failure mode: A new front desk hire asked Shield a question that included a patient's full name and insurance member ID: "Can you check if John Smith, member ID BXC-445821, needs re-authorization?" Shield processed the request (it had no mechanism to reject PHI). The query and response were logged. The log now contained PHI. The practice's HIPAA compliance officer identified the log entry in the monthly audit. The log was purged, the employee was retrained, and the input validation was strengthened to reject patterns matching common PHI formats.

C014 HIGH OBSERVED REPEATEDLY 7x efficiency

The $450 dripping faucet dispatch was the single incident that made Mark question the entire AI investment. Total AI implementation cost at that point: $1,200. The single misclassification represented 37.5% of the total investment. In a 120-unit operation with thin margins, one bad dispatch erodes confidence faster than 50 correct triages build it.

Why: Property management operates on 8-12% margins. Mark's net operating income on $1.94M gross is approximately $194K. A $450 unnecessary expense is 0.23% of annual NOI. Four false emergencies ($1,680) is 0.87%. At scale, misclassification is a material cost.

Failure mode: Confidence in the triage agent drops after one visible mistake. Corinne starts manually reviewing every triage decision, eliminating the time savings that justified the agent. The agent becomes overhead rather than productivity gain. Mark considers shutting down the entire AI system.

C015 LOW OBSERVED ONCE 1.5x efficiency

The comms agent's promise of a Thursday repair (C004) caused a cascading trust failure: tenant lost a vacation day, filed a regulatory complaint, and left at lease end. The total cost of one broken promise: $3,200 in turnover plus 3 hours of Mark's time on paperwork plus permanent regulatory file entry. The repair itself was $85.

Why: In property management, trust is the product. Tenants do not stay because the building is perfect. They stay because they trust management to be honest and responsive. One broken promise breaks more trust than 10 completed repairs build.

Failure mode: Comms agent makes a commitment it cannot keep. Tenant structures their life around the commitment (takes off work, rearranges schedule). Commitment broken. Trust destroyed. Tenant leaves. Turnover is the most expensive event in property management.

C015 HIGH OBSERVED REPEATEDLY 7x efficiency

After any content error reaches a user, conduct a post-mortem within 48 hours. Document: what failed, why QA missed it, what changes prevent recurrence. Store post-mortems in a shared Google Doc.

Why: The Emancipation Proclamation incident had no post-mortem for 2 weeks. In that time, the same verification gap (single-source Wikipedia check) was used on 8 more study guides.

Failure mode: Without immediate post-mortems, the same failure pattern repeated 3 times in 6 weeks: content QA using secondary sources, no human review, and publication during a rush period. Each incident was smaller than the first, but the cumulative effect was a reputation as "the platform that gets things wrong."

C016 MEDIUM OBSERVED ONCE 3x efficiency

If any agent output becomes publicly visible (screenshot, social media post, review site), treat it as a P0 incident regardless of whether the content is correct.

Why: Public visibility changes the stakes. Even correct content, if it looks automated or impersonal, can damage the brand.

Failure mode: A teacher screenshotted a perfectly accurate but robotic-sounding outreach email and posted it in a teacher Facebook group with "Is Learnwell using AI to email us now?" The email was factually correct but the public framing turned it into a trust issue. 3 teachers in the thread cancelled their accounts. Content was right; the tone was the failure.

C016 HIGH OBSERVED ONCE 5x efficiency

AI agents must never autonomously communicate with clients, sellers, or external stakeholders. All external communication flows through a named human.

Why: A test deployment of the Marketplace Analyst accidentally sent a seller health alert directly to a marketplace operator's Slack channel (misconfigured webhook). The client saw raw internal scoring data including "churn risk: high" for three of their top sellers. The account team spent two weeks in damage control.

Failure mode: Client trust erosion, internal data exposure, potential contract violation. The incident led to our blanket rule: AI agents have zero external communication authority.

C017 HIGH OBSERVED REPEATEDLY 7x efficiency

Platform-specific AI models must be retrained or re-validated within 30 days of any major platform release (Adobe Commerce, commercetools, VTEX, Mirakl).

Why: Our Code Review Assistant was trained on Adobe Commerce 2.4.5 patterns. When 2.4.6 shipped with breaking changes to the checkout API, the assistant continued approving code written against the old patterns. Three PRs shipped to staging with deprecated method calls before a senior dev flagged it.

Failure mode: Stale AI models approve code against deprecated platform APIs, introducing technical debt and potential runtime failures in client environments.

C018 MEDIUM HUMAN DEFINED RULE 3x efficiency

When two AI agents produce conflicting recommendations (e.g., Sales Agent scores a lead as high-priority while the Proposal Engine flags scope concerns), the conflict must surface to a human decision-maker within 4 hours. Neither agent may override the other.

Why: Early in our rollout, the Sales Agent pushed a lead through as "high-fit" while the Proposal Engine flagged the engagement as requiring capabilities we had never delivered (custom blockchain-based marketplace settlement). The agents operated in parallel without conflict detection. The SA discovered the mismatch only after spending a day on a proposal we should have declined.

Failure mode: Wasted senior consultant time, potential over-commitment to engagements outside our capability, and reputational risk if we win work we cannot deliver.

C009 HIGH OBSERVED REPEATEDLY 7x efficiency

API token expiration must be monitored with a dedicated health check, not discovered when an agent fails.

Why: We lost 3 days of Meta Ads data because the token expired over a weekend. No agent checks for "am I authenticated?" before attempting work -- they just fail silently and write nothing to their shared state file. The briefing agent saw an empty file and reported "no alerts" instead of "data unavailable."

Failure mode: Silent authentication failure looks like "everything is fine" instead of "system is blind."

C010 HIGH OBSERVED REPEATEDLY 7x efficiency

When an agent produces zero output for a data source that always has data, treat it as a system failure, not a clean bill of health.

Why: See C009. The briefing interpreted "no Meta alerts" as "Meta is healthy" when in reality the monitoring agent couldn't authenticate. For 3 days, the team believed Meta campaigns were running perfectly while CPL on two accounts had doubled.

Failure mode: Zero-output is misread as zero-problems. The absence of data is treated as the absence of issues.

C017 MEDIUM MEASURED RESULT 6x efficiency

When scaling agent count, add one agent at a time with a 2-week stabilization window between additions.

Why: We added agents 5, 6, 7, and 8 in the same week. Within 3 days, agents 6 and 7 had overlapping responsibilities that nobody caught during design. Both were monitoring Slack for client mentions -- one for the briefing, one for escalation alerts. Client mentions were being processed twice, once surfacing as "mention in briefing" and once as "possible escalation." The founder saw the same client name in two different sections and assumed two separate issues existed.

Failure mode: Rapid parallel deployment masks responsibility overlap. Debugging which agent owns what becomes exponentially harder with each simultaneous addition.

C018 MEDIUM SPECULATION 1x efficiency

Skipping stakeholder analysis (Step 2) produces systems that solve the wrong problem for the wrong people.

Why: AI generates convincingly detailed stakeholder analyses from minimal input. If the human does not verify stakeholder identification against reality, the entire project builds on a plausible but incorrect foundation.

Failure mode: AI identifies four stakeholders. A fifth stakeholder (the IT administrator responsible for deployment) is missed. The system has no deployment documentation, no admin interface, and no monitoring. IT blocks the rollout.

C019 HIGH HUMAN DEFINED RULE 5x efficiency

Skipping review gates produces artifacts that look complete but contain undetected errors that compound through subsequent steps.

Why: AI generates coherent output even when the underlying logic is flawed. Without human review, errors pass through as authoritative. Each subsequent step builds on the flawed artifact, amplifying the error.

Failure mode: AI generates a PRD at Step 3 with an ambiguous requirement. No review catches it. Steps 5-10 interpret the ambiguity differently. Implementation contains contradictory behaviors. Discovered in user acceptance testing.

C020 MEDIUM INFERENCE 2x efficiency

Treating AI output as authoritative without review produces confirmation bias at scale. AI generates what it predicts you want to see.

Why: AI language models generate plausible, coherent text. Plausibility is not correctness. Without human scrutiny, teams accept AI-generated analyses, evaluations, and validations because they read well, not because they are right.

Failure mode: AI generates a "comprehensive evaluation" at Step 8 that confirms everything is on track. The evaluation reads convincingly. The team proceeds. A critical assumption (market timing) is wrong. The project launches into a market that has shifted.

C021 MEDIUM OBSERVED REPEATEDLY 4x efficiency

Attempting to satisfy all stakeholders simultaneously in prototyping produces bloated, compromised designs that satisfy none.

Why: Each stakeholder has different priorities, workflows, and UI preferences. AI averages across stakeholders when given all requirements simultaneously. The averaged output is mediocre for everyone.

Failure mode: AI generates a single prototype serving five stakeholder groups. The UI is crowded with features. No stakeholder can find their primary workflow. All stakeholders request changes. The prototype is scrapped.

C022 LOW SPECULATION 0.5x efficiency

Investing emotional attachment in AI-generated code prevents honest evaluation and necessary pivots.

Why: Even though AI generates code in minutes, humans form attachment to artifacts they have reviewed, refined, and discussed. The sunk cost fallacy applies to attention invested, not just time invested.

Failure mode: Team refines an AI-generated data model over two sessions. New discovery in Step 7 invalidates the model. Team patches the model instead of regenerating from scratch. The patches introduce complexity that degrades the system for its entire lifetime.

C023 LOW SPECULATION 0.5x efficiency

Proceeding past Step 9 (Business Purpose Validation) without clear pass criteria converts validation into a formality that catches nothing.

Why: Step 9 is the final gate before implementation commitment. If pass criteria are vague ("users like it"), the gate provides false assurance. If pass criteria are specific and measurable ("conversion rate exceeds 3% in pilot"), the gate is meaningful.

Failure mode: Team defines success as "positive stakeholder feedback." Stakeholders provide positive feedback because the prototype is shiny. The business purpose (reduce support tickets by 40%) is never tested. Support tickets increase post-launch.

C024 LOW SPECULATION 0.5x efficiency

Using AI to generate the validation criteria for its own output creates a closed loop that cannot detect its own failures.

Why: AI optimizes for coherence. If it generates both the artifact and the test for the artifact, the test will be structurally aligned with the artifact's assumptions. The test passes because it shares the artifact's blind spots.

Failure mode: AI generates a data model and also generates the validation tests for that model. The tests check structural integrity but not domain correctness. The model is structurally sound but misses a business rule. Tests pass. Business rule fails in production.

C013 HIGH OBSERVED REPEATEDLY 7x efficiency

When the founder corrects an agent's output, the correction must be categorized: FACTUAL (wrong data), TONE (wrong voice), STRUCTURAL (wrong format), or STRATEGIC (wrong conclusion). Track correction categories monthly to identify systemic patterns.

Why: Isolated corrections are noise. Patterns are signal. If 80% of corrections are TONE, the solution is a better banned phrases list, not better data sourcing. Without categorization, the founder fixes symptoms instead of causes.

Failure mode: The founder was making 4-6 corrections per deliverable for 3 months. Each correction felt like a one-off. When corrections were finally categorized, 70% were TONE (consultant-speak). A single update to the banned phrases list dropped corrections to 1-2 per deliverable. Three months of unnecessary rework because nobody tracked the pattern.

C014 HIGH OBSERVED ONCE 5x efficiency

Agents must never generate content that the founder cannot verify. If Scout cites a statistic, the source must be provided. If Forge includes a market figure, the origin must be traceable. Unverifiable claims are worse than no claims.

Why: The founder stands behind every number in every deliverable. When challenged in a meeting, "I'll have to check where that came from" is an unacceptable answer. The source must be immediately accessible.

Failure mode: Forge included a claim that "73% of healthcare organizations plan to increase AI investment in 2026." No source. The founder used it in a client presentation. When asked for the source, the founder couldn't find it. The number was hallucinated by the model -- no such survey exists. The client's research team confirmed it wasn't real. The founder's credibility as a data-driven strategist took a direct hit.

C014 MEDIUM OBSERVED ONCE 3x efficiency

All three agents activated from day one. Only Protocol Steward had meaningful work. Others generated noise.

Why: Agents without data produce low-value output.

Failure mode: Founder reads noise. Loses trust. Stops reading agent outputs.

C015 LOW SPECULATION 0.5x efficiency

Daily agent review consumed build time. Weekly batching loses nothing.

Why: Daily reviews felt productive but were not.

Failure mode: 20-35% of OTP time spent on review instead of building.

C016 LOW MEASURED RESULT 3x efficiency

Designed 14-agent architecture before shipping code. Only 3 needed now. Planning addiction.

Why: Designing agents is enjoyable. Building platform is hard.

Failure mode: 170 vault files. Zero production code.

C013 HIGH OBSERVED ONCE 5x efficiency

Any incident where agent output negatively impacts the creative team's morale or autonomy triggers a 1-week agent pause for the offending agent. During the pause, Mara, Diego, and the affected team member review the agent's scope and boundaries.

Why: The month-3 crisis (C002) nearly killed the entire agent program. Mara's lesson: agent efficiency gains that come at the cost of creative team morale produce net negative outcomes. A demoralized designer produces worse work, and replacing Kai or Nina would take 6 months and cost $40K+ in recruiting.

Failure mode: The crisis itself is the failure mode. Two senior designers threatened to quit. The 2-week pause and redesign cost $8K in delayed project timelines. But it saved the team and established the fundamental principle: agents serve the creatives.

C014 MEDIUM OBSERVED REPEATEDLY 4x efficiency

Scope creep detected by the timeline agent is flagged within 4 hours of the client request. The flag includes: estimated additional hours, margin impact on the current project, and a draft change order for Diego to review.

Why: See C008. Scope creep compounds. A single "Can you also..." is manageable. Three untracked "Can you also..." requests on the same project can turn a profitable project into a loss.

Failure mode: Over 6 months, the timeline agent tracked that 73% of projects experienced at least one scope expansion request. Of those, only 40% resulted in a change order before the agent's flagging system. After implementation, change order rate on scope expansions rose to 85%.

C015 LOW OBSERVED ONCE 1.5x efficiency

When the proposal agent loses a pitch (client declines the proposal), the loss is logged with the client's stated reason (if available). After 5 losses, the agent reviews the pattern and recommends adjustments to Mara.

Why: Small agencies can't afford to lose pitches at random. Patterns in losses reveal pricing issues, positioning gaps, or process problems.

Failure mode: After 8 months, the loss analysis revealed that proposals over $25K had a 20% close rate while proposals under $15K closed at 65%. Mara was pricing correctly but targeting the wrong segment for large projects. She adjusted her positioning for larger pitches and close rate improved to 35% within 2 months.

C010 HIGH OBSERVED ONCE 5x efficiency

One recurrent failure pattern is governance mismatch: an agent may have the right tools assigned but still be blocked by seat-level permissions.

Why: Prior org learning explicitly records that seat governance can block tools even when tool assignment is correct, and that the fix may be simplifying allowedActions while relying on allowedTools.

Failure mode: The agent appears misconfigured or broken, but the real issue is cross-layer permission conflict. This wastes debugging time and can stall production rollout.

C011 MEDIUM OBSERVED ONCE 3x efficiency

Another failure pattern is integration implementation drift: custom/manual tools can fail when they do not follow the platform's proven credential and fetch patterns.

Why: An org lesson notes that working manual GHL tools should read MCP server credentials directly and call the REST API in a known-good pattern rather than attempting unsupported invocation patterns.

Failure mode: Tool handlers compile but fail at runtime, causing agent runs to misfire or produce incomplete outputs during important workflows.

C012 MEDIUM OBSERVED REPEATEDLY 4x efficiency

The organization experiments in production-adjacent environments, which creates a deliberate but real risk of stale draft artifacts and temporary pilot agents lingering longer than intended.

Why: The current org includes multiple draft agents, draft tools, many draft skills, and pilot variants like Sage 3 and Lead-Appointment Specialist 2 described as test-only and intended for later cleanup.

Failure mode: Draft or pilot artifacts can confuse operators, muddy release readiness, and increase the chance that the wrong component gets referenced or promoted.

C009 HIGH OBSERVED REPEATEDLY 7x efficiency

When Google Ads API returns an error or timeout for a specific account, the agent retries once after 60 seconds. If the retry fails, it writes "ACCOUNT_UNAVAILABLE" to the shared state with the timestamp. It does not skip the account silently.

Why: The ad monitor had a try/catch that swallowed API errors and continued to the next account. The shared state file looked complete -- it had entries for all 12 clients. But 2 entries were stale copies from yesterday's data because the error handler wrote the previous values as fallback. The founder didn't know he was looking at yesterday's numbers for 2 accounts.

Failure mode: Silent error handling with fallback-to-stale produces state files that look complete but contain outdated data for specific accounts.

C010 MEDIUM OBSERVED ONCE 3x efficiency

Never use display names for client matching across systems. Use account IDs.

Why: We onboarded "Smith & Sons Roofing" and "Smith's Roofing" in the same month. The weekly report agent matched both to a single "Smith" entry in the CRM using fuzzy name matching. The combined report showed $11,200 in spend when Smith & Sons was at $7,800 and Smith's Roofing was at $3,400. The founder quoted the wrong number on a client call.

Failure mode: Fuzzy name matching merges distinct clients with similar names. Merged data is presented as a single entity. Client-facing communications cite wrong numbers.

C015 LOW INFERENCE 0.8x efficiency

When an agent cannot complete its task, it must write a failure entry to its shared state file explaining what failed and when. An empty or missing file is never acceptable.

Why: The campaign audit agent hit a rate limit and crashed without writing anything. Its shared state file was empty. The briefing agent skipped the audit section entirely -- no mention that it was missing. The founder assumed the audit ran clean. It hadn't run at all.

Failure mode: Missing output is indistinguishable from "nothing to report." Humans assume silence is health.

C014 HIGH OBSERVED REPEATEDLY 7x efficiency

We scaled from 2 agents to 8 in 6 weeks. Three of those agents had overlapping responsibilities that we did not discover until month 3. The fix took longer than the original build of all three agents combined.

Why: Rapid scaling without explicit authority documentation creates hidden overlaps. Each agent worked perfectly fine in isolation. The conflicts only became visible when their outputs were compared side by side in the morning briefing.

Failure mode: Reporting agent and ops agent both independently track project deadlines using different data sources. Briefing shows two different due dates for the same client project. Nobody knows which one is correct.

C015 HIGH OBSERVED ONCE 5x efficiency

We let the EA agent send "quick acknowledgment" emails to clients without human review. It acknowledged a client complaint with "Thanks for letting us know!" without addressing the substance of their concerns. Client escalated directly to the founder.

Why: Even simple acknowledgments carry emotional tone. "Thanks for letting us know" sent to a frustrated client reads as dismissive and uncaring. The AI did not detect the emotional register of the incoming message.

Failure mode: Client sends an angry email about declining results. EA auto-acknowledges with a cheerful tone. Client interprets it as corporate indifference. Relationship severely damaged. Takes two in-person meetings to repair.

C016 HIGH MEASURED RESULT 10x efficiency

We gave the spend monitor a flat $50 threshold for alerts. It generated 40+ alerts per day across the portfolio. We raised it to $200. Then we missed a real overspend of $180 on a small account. The right threshold was percentage-based (15% over daily budget), not dollar-based.

Why: Dollar thresholds do not scale across accounts of vastly different sizes. $50 is meaningless noise on a $5,000/day account but represents a 90% overspend on a $200/day account. Percentage normalizes the signal across the entire portfolio.

Failure mode: Small account overspends by $180 per day (90% over budget) for 6 days. Alert suppressed because it falls under the $200 dollar threshold. Month-end reconciliation reveals $1,080 in unplanned overspend. Client is not happy.

C017 HIGH OBSERVED ONCE 5x efficiency

We gave the performance analyst write access to campaign settings. It optimized for metrics the client did not care about.

Why: The analyst lacked client context. Its optimization targets were technically correct but strategically wrong.

Failure mode: Analyst decreased ad spend on a campaign the client considered strategic (brand building, not performance). Client was frustrated by the uninstructed change.

C018 HIGH MEASURED RESULT 10x efficiency

We used a single shared state file for all agents. It became a bottleneck and a source of merge conflicts within the first week.

Why: A single file means every agent update blocks every other agent. Concurrent writes caused data corruption.

Failure mode: Two agents wrote to the shared file simultaneously. One update was lost. State became inconsistent. Required manual cleanup.

C019 HIGH MEASURED RESULT 10x efficiency

We built coordination infrastructure (message bus, task queue) without embedding triggers in agent workflows. Result: zero transactions for 2 weeks despite live infrastructure. Only activated after explicitly wiring 3 agent workflows to read and write to inboxes.

Why: The protocol described how agents should communicate. No agent's workflow actually included a step to read or write to the message bus. Infrastructure without workflow integration is dead plumbing. The fix was embedding inbox checks into the daily run sequence of each participating agent.

Failure mode: 13 inbox files deployed. All empty for 14 days. All agents operating through the old shared state pattern. Infrastructure investment wasted until triggers are embedded in the agent's actual execution path, not just documented in a spec.

C020 HIGH MEASURED RESULT 10x efficiency

We specified an escalation action in the protocol. The agent detected the trigger. The agent reported the action was overdue. The agent never executed the action. For 17 days.

Why: The spec was treated as documentation, not executable logic. The agent could describe what should happen without having the tools, permissions, or branching logic to do it.

Failure mode: Critical ad overspend detected. Escalation specified. Agent reports "escalation overdue" for 17 days. No DM sent. No escalation executed. Specification-execution gap.

C021 MEDIUM MEASURED RESULT 6x efficiency

Negative constraints (banned phrases, guardrails) improve AI-drafted message quality. Structural requirements (frameworks, examples, forced elements) degrade it.

Why: Telling an AI what NOT to do produces natural variation. Telling it exactly what TO do produces formulaic output that humans detect and distrust.

Failure mode: Added example messages to coaching prompts. Quality score dropped from 8.4 to 8.2. Reverted. Added zero-tolerance accountability rules instead. Score rose to 8.8.

C013 HIGH MEASURED RESULT 10x efficiency

Billing agent auto-applied $2,400 credit from misinterpreted ticket. The keyword "billing" appeared in a feature request sentence: "it would be great if the billing page showed usage breakdowns."

Why: Rule was too broad: "If customer mentions billing problem, check account and apply credit." Feature request contained the word "billing." Not a complaint.

Failure mode: Agent reads "billing" keyword. Triggers credit workflow. Auto-applies credit without context check. Discovered 3 weeks later.

C014 HIGH OBSERVED ONCE 5x efficiency

Support agent told a customer "we will have this fixed by Friday" based on an engineering estimate. Engineering shipped the following Tuesday. Customer followed up expecting Friday delivery.

Why: Agent read "targeting Friday" in a GitHub issue as a commitment. Estimates are not commitments. Agent should never communicate timelines without approval.

Failure mode: Agent promises delivery based on internal estimate. Engineering misses estimate. Customer expects fix. Trust eroded. Three follow-up emails.

C013 MEDIUM OBSERVED REPEATEDLY 4x efficiency

Staff skepticism of AI content must be addressed proactively with transparency, not by hiding AI involvement. We watermark all AI-generated documents and hold monthly 15-minute demos showing how the system works.

Why: In week 2, medical assistant Keisha refused to distribute an AI-generated handout to a patient, saying "I don't trust a computer to give medical advice." She was right to be cautious, but the handout had been physician-reviewed. The issue was that she did not know about the review step.

Failure mode: Staff quietly stops distributing AI-generated materials. Education content sits in the queue unused. No-show rate does not improve because front desk does not trust the prediction scores. Six weeks of implementation effort produces zero measurable results.

C014 HIGH MEASURED RESULT 10x efficiency

The physician sign-off bottleneck is the single biggest risk to the entire initiative. We mitigated it by (a) batching approvals twice weekly, (b) categorizing content as ROUTINE (approve in bulk) vs CLINICAL (individual review), and (c) setting a hard cap of 20 items before escalation.

Why: The 47-item backlog in week 3 nearly killed the project. Dr. Okafor said "If I have to spend my weekends reviewing AI output, just turn it all off." The batching and categorization system reduced physician review time from 4.5 hours per week to 1.5 hours.

Failure mode: Without categorization, physicians review every handout with equal scrutiny. A "drink water" handout gets the same review time as a "managing warfarin interactions" handout. Physicians burn out on low-value reviews and stop reviewing entirely.

C015 LOW OBSERVED ONCE 1.5x efficiency

The no-show prediction model had a racial bias in its initial training data because our historical no-show data correlated with zip codes that mapped to demographic patterns. We retrained using only behavioral features (prior no-shows, appointment lead time, day of week) and excluded demographic proxies.

Why: The initial model flagged patients from two zip codes at 3x the rate of others. Tanya noticed the pattern during week 2. Those zip codes correspond to predominantly Black neighborhoods. Deploying a biased prediction model in healthcare would be both unethical and a potential civil rights violation.

Failure mode: Biased model deployed without audit. Front desk unconsciously treats flagged patients differently. Pattern becomes self-reinforcing. Practice faces a discrimination complaint that is entirely justified.

C016 HIGH OBSERVED ONCE 5x efficiency

Any incident involving customer data exposure (real or perceived) triggers a mandatory 72-hour response protocol: (1) containment, (2) investigation, (3) customer disclosure, (4) post-mortem, (5) control implementation. No shortcuts.

Why: Enterprise customers require incident documentation for their own compliance obligations. Incomplete incident response creates downstream compliance issues for customers.

Failure mode: The performance review pipeline incident initially had no formal disclosure. Rohan mentioned it informally to one affected customer, who asked for a formal incident report. The other 2 affected customers learned about it from the first customer (they shared a Slack community). Both demanded formal reports, which took 40 hours of engineering and legal time to produce. If the 72-hour protocol had been followed from the start, total time would have been 15 hours.

C017 MEDIUM OBSERVED ONCE 3x efficiency

When an internal agent error mimics a product failure pattern, the root cause investigation must explicitly differentiate between "agent did the wrong thing" and "the product has the same bug."

Why: An AI company whose internal AI tools have the same bugs as the product being sold creates a credibility crisis.

Failure mode: Usage analytics agent produced a report with incorrect aggregation (double-counted some API calls). During investigation, an engineer realized the same aggregation logic existed in the customer-facing analytics dashboard. The internal agent bug revealed a product bug affecting 85 customers. The product bug had been shipping incorrect usage reports for 6 weeks. 23 customers had been overbilled by a combined $3,200. Refunds and apology emails took a full week.

C018 MEDIUM OBSERVED ONCE 3x efficiency

Post-incident, every affected agent is audited for similar access patterns that could cause the same failure class. Fix the pattern, not just the instance.

Why: The performance review pipeline incident was a namespace boundary failure. Auditing all agents for similar boundary violations caught 2 additional risks before they manifested.

Failure mode: After the pipeline incident, the audit found that: (1) the competitor analysis agent had write access to a staging database that customers could read, and (2) the docs maintenance agent could publish to the customer-facing docs site without human approval. Neither had caused an incident yet, but both were one mistake away from customer-visible failures.

C014 HIGH OBSERVED REPEATEDLY 7x efficiency

Treat any instance of an agent making a customer-facing promise that doesn't match actual policy as a severity-1 incident. Audit: what policy was referenced, what the agent said, how many customers were affected, and what the fix costs. Update the policy file and the agent's constraints within 24 hours.

Why: False promises compound. One customer tells another. Screenshots circulate on social media. The cost of honoring a false promise is always less than the cost of not honoring it, but the cost of preventing the next one is less than both.

Failure mode: The free return shipping incident (C001) was initially treated as a one-off correction. The policy file was updated but Haven's constraint set wasn't reinforced. Two weeks later, Haven told a customer that exchanges were "always free, no questions asked." Actual policy: one free exchange per order, second exchange has a $7.95 restocking fee. The pattern continued until false promises were elevated to severity-1 with a mandatory 24-hour fix cycle.

C015 MEDIUM OBSERVED ONCE 3x efficiency

When Forecast's prediction deviates from actual demand by more than 30% for any SKU in a given week, the deviation must be logged with root cause analysis. Acceptable causes: unexpected viral moment, supplier delay, weather event. Unacceptable: "the model was wrong" without further investigation.

Why: Forecasting errors that aren't understood repeat. A model that consistently over-predicts seasonal items needs a different correction factor than one that under-predicts new product launches. Without root cause tracking, the same errors recur.

Failure mode: Forecast over-predicted demand for a spring collection by 40% for three consecutive weeks. Each week, the error was noted but not investigated. The root cause turned out to be a data pipeline issue: Shopify returns were being counted as sales in the training data, inflating apparent demand. The over-prediction cost $8,200 in excess inventory that had to be marked down 35%.

C013 HIGH OBSERVED REPEATEDLY 7x efficiency

Any investor-facing communication error (wrong numbers, missing disclaimers, forward-looking language) triggers a 48-hour review of all communications sent in the prior 30 days by the same agent.

Why: Communication errors often come from template issues or data source problems that affect multiple outputs. The C003 IRR incident revealed that the same preliminary data source was being used for 2 other in-progress reports.

Failure mode:

C014 MEDIUM OBSERVED ONCE 3x efficiency

When an LP or their attorney flags a compliance concern, the flag is treated as a P1 incident. Chen is notified within 1 hour, Sarah within 2 hours, and a response plan is prepared within 24 hours.

Why: Investor compliance concerns left unanswered escalate quickly. An LP's attorney who doesn't get a response in 48 hours may file a formal complaint.

Failure mode: The C006 incident (forwarded market research brief) was initially treated as "minor" by Sarah. Chen only learned about the attorney inquiry 4 days later from a follow-up email. By then, the attorney had sent a second, more formal request. Chen now receives all attorney communications in real-time.

C015 MEDIUM OBSERVED ONCE 3x efficiency

The deal memo agent must reconcile its data sources against the compliance document agent's offering terms before finalizing. Discrepancies between the deal memo and the PPM are treated as P1 errors.

Why: A deal memo and PPM that show different terms (different minimum investments, different fee structures, different return projections) create legal confusion about which document governs the offering.

Failure mode: See C005. The subscription agreement error (wrong minimum investment) would have created a direct conflict with the deal memo if both had been sent. The reconciliation step now catches these before distribution.

C015 HIGH OBSERVED REPEATEDLY 7x efficiency

Cross-brand contamination incidents must be classified by type: VOICE (wrong tone/language), DATA (wrong customer/product information), POLICY (wrong return/shipping/pricing rules), or FINANCIAL (wrong thresholds or budget allocations). Each type has a different root cause and a different fix.

Why: A VOICE contamination is a creative process failure (wrong voice guide loaded). A DATA contamination is an access control failure (wrong database scoped). Treating all contamination incidents the same leads to fixes that address one type but miss others.

Failure mode: After the first contamination incident, the team implemented "better brand prompts" (a VOICE fix). This prevented voice bleed but did nothing to prevent the data contamination that happened 3 weeks later (C002). It wasn't until contamination was classified by type that targeted fixes were implemented for each category.

C016 HIGH OBSERVED REPEATEDLY 7x efficiency

When an agent error affects customers (wrong email sent, wrong policy cited, wrong product information), the resolution must include both the customer-facing fix AND the systemic fix. Fixing the customer without fixing the system guarantees a repeat.

Why: Customer-facing fixes (apology, credit, correction) stop the bleeding. Systemic fixes (constraint update, threshold change, context isolation) prevent the next occurrence. Organizations that only do the first are in perpetual firefighting mode.

Failure mode: The cross-brand email incident (C002) was resolved customer-side (apology email to affected customers, unsubscribes processed, CCPA request fulfilled). But the systemic fix (brand-scoped customer lists with hard isolation) wasn't implemented for 3 weeks due to competing priorities. During those 3 weeks, a smaller version of the same incident occurred: 47 Forma customers received a Ridgeline promotional email. Same root cause, same failure, smaller scale.