Documented things that go wrong and how to prevent them. Failure pattern claims are among the most valuable in any OOS because they encode lessons learned the hard way. Other organizations can learn from these without experiencing the failures themselves.
Gave analytics agent write access to campaigns. It optimized for wrong metrics.
Why: Lacked client context.
Failure mode: Decreased spend on strategic brand campaign.
Single shared state file became bottleneck and corruption source.
Why: Concurrent writes caused data races.
Failure mode: Two agents update simultaneously. One update lost.
AI agents must never autonomously communicate with clients, sellers, or external stakeholders. All external communication flows through a named human.
Why: A test deployment of the Marketplace Analyst accidentally sent a seller health alert directly to a marketplace operator's Slack channel (misconfigured webhook). The client saw raw internal scoring data including "churn risk: high" for three of their top sellers. The account team spent two weeks in damage control.
Failure mode: Client trust erosion, internal data exposure, potential contract violation. The incident led to our blanket rule: AI agents have zero external communication authority.
Platform-specific AI models must be retrained or re-validated within 30 days of any major platform release (Adobe Commerce, commercetools, VTEX, Mirakl).
Why: Our Code Review Assistant was trained on Adobe Commerce 2.4.5 patterns. When 2.4.6 shipped with breaking changes to the checkout API, the assistant continued approving code written against the old patterns. Three PRs shipped to staging with deprecated method calls before a senior dev flagged it.
Failure mode: Stale AI models approve code against deprecated platform APIs, introducing technical debt and potential runtime failures in client environments.
When two AI agents produce conflicting recommendations (e.g., Sales Agent scores a lead as high-priority while the Proposal Engine flags scope concerns), the conflict must surface to a human decision-maker within 4 hours. Neither agent may override the other.
Why: Early in our rollout, the Sales Agent pushed a lead through as "high-fit" while the Proposal Engine flagged the engagement as requiring capabilities we had never delivered (custom blockchain-based marketplace settlement). The agents operated in parallel without conflict detection. The SA discovered the mismatch only after spending a day on a proposal we should have declined.
Failure mode: Wasted senior consultant time, potential over-commitment to engagements outside our capability, and reputational risk if we win work we cannot deliver.
All three agents activated from day one. Only Protocol Steward had meaningful work. Others generated noise.
Why: Agents without data produce low-value output.
Failure mode: Founder reads noise. Loses trust. Stops reading agent outputs.
Daily agent review consumed build time. Weekly batching loses nothing.
Why: Daily reviews felt productive but were not.
Failure mode: 20-35% of OTP time spent on review instead of building.
Designed 14-agent architecture before shipping code. Only 3 needed now. Planning addiction.
Why: Designing agents is enjoyable. Building platform is hard.
Failure mode: 170 vault files. Zero production code.
We gave the performance analyst write access to campaign settings. It optimized for metrics the client did not care about.
Why: The analyst lacked client context. Its optimization targets were technically correct but strategically wrong.
Failure mode: Analyst decreased ad spend on a campaign the client considered strategic (brand building, not performance). Client was frustrated by the uninstructed change.
We used a single shared state file for all agents. It became a bottleneck and a source of merge conflicts within the first week.
Why: A single file means every agent update blocks every other agent. Concurrent writes caused data corruption.
Failure mode: Two agents wrote to the shared file simultaneously. One update was lost. State became inconsistent. Required manual cleanup.
We built coordination infrastructure (message bus, task queue) without embedding triggers in agent workflows. Result: zero transactions for 2 weeks despite live infrastructure. Only activated after explicitly wiring 3 agent workflows to read and write to inboxes.
Why: The protocol described how agents should communicate. No agent's workflow actually included a step to read or write to the message bus. Infrastructure without workflow integration is dead plumbing.
Failure mode: 13 inbox files empty for 14 days. Zero transactions. Triggers must be in execution path, not just spec.
We specified an escalation action in the protocol. The agent detected the trigger. The agent reported the action was overdue. The agent never executed the action. For 17 days.
Why: Spec treated as documentation, not executable logic. Agent could describe what should happen without tools or branching logic to do it.
Failure mode: Critical ad overspend detected. Escalation specified. Agent reports "escalation overdue" for 17 days. No DM sent. No escalation executed. Specification-execution gap.
Negative constraints (banned phrases, guardrails) improve AI-drafted message quality. Structural requirements (frameworks, examples, forced elements) degrade it.
Why: Telling an AI what NOT to do produces natural variation. Telling it exactly what TO do produces formulaic output that humans detect and distrust.
Failure mode: Added example messages to coaching prompts. Quality score dropped from 8.4 to 8.2. Reverted. Added zero-tolerance accountability rules instead. Score rose to 8.8.
Gave analyst write access to campaigns. Optimized for wrong metrics.
Why: Lacked client context. Technically correct, strategically wrong.
Failure mode: Decreased spend on clients strategic brand campaign. Client frustrated.
Single shared state file caused 12 corruption events in month one.
Why: Concurrent writes from multiple agents created data races.
Failure mode: Two agents wrote simultaneously. One update lost. Manual cleanup required.
Built message bus without embedding triggers in workflows. Zero transactions.
Why: Protocol described communication. No workflow included steps to use it.
Failure mode: 13 inbox files deployed. All empty. Dead infrastructure.
Agent reported escalation overdue for 17 days without executing it.
Why: Spec treated as documentation, not executable logic.
Failure mode: Critical overspend detected. Escalation specified. Never executed. 17 days.
Negative constraints improve AI message quality. Structural requirements degrade it.
Why: Telling AI what NOT to do produces natural variation.
Failure mode: Added example messages: quality dropped 8.4 to 8.2. Added ban rules: quality rose to 8.8.