Operational Heuristics

C010 MEDIUM OBSERVED ONCE 3x efficiency

Unresolvable errors: log and stop. No automatic retry.

Why: Retries on unresolvable errors create noise.

Failure mode: Agent retries 100 times, consumes rate limits.

C008 HIGH MEASURED RESULT 10x efficiency

GPT creative output undergoes a mandatory 2-hour hold before entering the review queue. No same-session generation and approval.

Why: Pattern: account managers reviewing GPT output immediately after generation approved it at a 94% rate. When we added a 2-hour hold (the AM reviews the batch later in the day or the next morning), the approval rate dropped to 71%. The 23% gap was copy that "sounded good in the moment" but had issues visible with fresh eyes -- subtle tone mismatches, claims that were technically true but misleading, and formatting that didn't match the client's brand voice.

Failure mode: Same-session review creates familiarity bias. The reviewer just saw the brief, the context is loaded, and the output feels like a natural continuation. Distance improves judgment.

C009 HIGH MEASURED RESULT 10x efficiency

Track and report the cross-model error rate separately from single-model error rates. Any task that involves both Claude and GPT is measured as a distinct category.

Why: Our overall agent error rate was 3.2%. When we segmented, single-model tasks (Claude-only or GPT-only) had a 1.8% error rate. Cross-model tasks had an 8.7% error rate -- nearly 5x. The errors were concentrated at handoff points: incomplete context transfer, schema mismatches, and misinterpretation of structured fields. Without segmenting, the 3.2% blended rate masked a systemic handoff problem.

Failure mode: Blended error rates hide that cross-model handoffs are the primary failure point. Resources are allocated to improving individual agents when the real problem is the integration layer.

C016 LOW OBSERVED ONCE 1.5x efficiency

Every agent must include its model name and version in the metadata of its shared state file output.

Why: When debugging an analysis discrepancy, we couldn't tell which model version produced a particular output. The ad monitor had been running on Claude 3 Opus while the pacing agent had been upgraded to Claude 3.5 Sonnet. Their outputs used different rounding conventions, making numbers mismatch by $1-3 per metric. Model version in metadata would have identified the discrepancy source in minutes instead of the 2 hours it took.

Failure mode: Without model version tracking, debugging cross-agent discrepancies requires testing each agent individually. Root cause identification is slow when version differences aren't visible.

C010 HIGH OBSERVED REPEATEDLY 7x efficiency

Ad performance is evaluated at the location level with a minimum 14-day window. No optimization decisions are made on less than 14 days of data per location.

Why: Small-market locations (Tampa, Phoenix suburbs) have low daily lead volume. Day-to-day variance is enormous. A "bad day" is meaningless; a bad 2-week trend is actionable.

Failure mode: The ads agent paused a Google Ads campaign for the Tampa location after 5 days of zero leads. The campaign had averaged 2.1 leads/day over the prior month. The 5-day drought was within normal variance. Restarting the campaign lost 3 days of learning and reset the algorithm.

C011 MEDIUM OBSERVED ONCE 3x efficiency

Lead quality scoring weights location-specific conversion history over network averages. A "hot" lead in Chicago (where average ticket is $149/mo) has different characteristics than a "hot" lead in Tampa (where average ticket is $89/mo).

Why: Network-wide lead scoring models produce false positives in lower-ticket markets and false negatives in higher-ticket markets.

Failure mode: The lead distribution agent prioritized Tampa leads using the Chicago-trained scoring model. It flagged members interested in premium personal training as "hot." Tampa doesn't offer premium PT. Staff called 30 "hot" leads pitching a service that didn't exist at their location.

C012 MEDIUM OBSERVED REPEATEDLY 4x efficiency

Seasonal patterns are tracked per-location, not network-wide. January surges vary by 40-60% across locations. Summer dips range from 10% to 35% depending on climate and demographics.

Why: Using network-average seasonality for budget planning over- or under-allocates at the extremes.

Failure mode: The ads agent applied a uniform 30% January budget increase across all locations. Chicago needed 50% (cold weather drives indoor fitness demand). Tampa needed only 15% (year-round outdoor fitness options). Tampa overspent by $1,200 in January. Chicago underspent and missed 40+ leads.

C009 MEDIUM OBSERVED REPEATEDLY 4x efficiency

When a client requests more than 3 rounds of revisions on a single deliverable, flag it for Marcus as a potential scope issue before logging revision 4.

Why: Unlimited revisions is the silent killer of creative agency margins. The flag forces a conversation about whether to charge for additional rounds.

Failure mode: One project went through 7 revision rounds without anyone noticing the pattern. The client was happy but the project margin was -12%. Marcus didn't realize until quarterly review.

C010 MEDIUM OBSERVED REPEATEDLY 4x efficiency

Shot list suggestions must include at least 2 reference links from the client's existing brand content or stated references. Never suggest shots without grounding in the client's visual language.

Why: Generic shot suggestions feel like templates. Grounded suggestions feel like someone studied the brand.

Failure mode: Agent suggested a "drone reveal shot" for a brand that exclusively uses intimate, handheld footage. Designer flagged it as "clearly from a machine that doesn't understand the brand." Technically correct suggestion, totally wrong for the client.

C011 MEDIUM OBSERVED ONCE 3x efficiency

Invoice reminders to clients use the same casual tone Marcus uses in his own emails. No formal language, no "please remit payment."

Why: Artifact's brand is approachable and creative. Formal invoice language feels like it's coming from a different company.

Failure mode: First automated invoice reminder used "Please find attached your outstanding invoice for services rendered." Client replied to Marcus: "Did you hire an accountant? Lol." Minor, but it broke the illusion of a small, personal shop.

C010 MEDIUM MEASURED RESULT 6x efficiency

Document assembly for a standard estate plan (trust, will, POA, advance directive) takes the agent 12 minutes. Priya's review takes 45 minutes per client package. Beth's formatting and preparation for signing takes 30 minutes. Total pipeline: 87 minutes per client versus the pre-agent baseline of 3.5 hours.

Why: Knowing the pipeline timing lets Priya schedule accurately. Before agents, she routinely underestimated document prep time and fell behind. Now she schedules 90-minute blocks per client and consistently hits the mark.

Failure mode: Without accurate pipeline timing, Priya overschedules. Four signing appointments in one day when she can only prepare for three. Last client's documents rushed. Error rate increases when Priya is behind schedule.

C011 MEDIUM MEASURED RESULT 6x efficiency

The scheduling agent sends three follow-up touches for annual trust reviews: 60 days before anniversary, 30 days, and 7 days. After the third touch with no response, it flags the client as DORMANT and stops. No more than 3 touches per year.

Why: Over-following-up annoys clients and feels desperate. Under-following-up loses annual review revenue ($750-$1,200 per review). Three touches at 60/30/7 days produced a 62% review booking rate, up from 38% when Beth was sending manual reminders on an ad hoc schedule.

Failure mode: Without a cap, the agent sends monthly reminders. Client perceives the firm as aggressive. Leaves a negative review mentioning "constant harassment." Priya loses a client and a referral source. In a solo practice, every client lost is felt in the revenue.

C012 LOW OBSERVED ONCE 1.5x efficiency

The assembly agent generates documents in the firm's standard formatting: 12pt Times New Roman, 1-inch margins, numbered paragraphs, firm letterhead. It never uses alternate fonts, creative layouts, or formatting that deviates from the template.

Why: Estate planning documents are read by courts, trustees, financial institutions, and opposing counsel. Non-standard formatting signals carelessness. A bank once questioned a trust because the formatting looked "different from what we usually see" and requested a letter of opinion confirming its validity. That letter cost Priya 2 hours.

Failure mode: Agent assembles a trust with a different font because the template's font metadata was corrupted. Bank receiving the trust as part of an account titling process flags it as potentially invalid. Client calls Priya. Priya spends 2 hours writing a letter of opinion. $600 in non-billable time because of a font.

C011 HIGH OBSERVED REPEATEDLY 7x efficiency

Progress reports are generated biweekly, not weekly. Weekly reports create parent anxiety without providing meaningful new information.

Why: Tutoring progress is nonlinear. A bad week followed by a good week looks like a crisis and a recovery in weekly reports. Biweekly smooths the noise.

Failure mode: Weekly reports caused 3 parents to request "emergency conferences" in a single month because their child had one below-average session. Keisha spent 6 hours in unnecessary meetings. Moving to biweekly reduced parent escalations by 80%.

C012 MEDIUM OBSERVED REPEATEDLY 4x efficiency

Tutor session notes must be submitted within 24 hours of the session. The scheduling agent flags missing notes at the 24-hour mark.

Why: Tutors forget details after 24 hours. Late notes are less accurate, which contaminates progress reports.

Failure mode: One tutor submitted 3 weeks of notes in a single batch. The notes were vague ("worked on math") and unusable for progress reports. Keisha had to contact 8 families to apologize for the delayed report. She now pays tutors a $5 bonus for same-day notes.

C013 HIGH OBSERVED REPEATEDLY 7x efficiency

When a student misses 2 consecutive sessions without parent communication, the parent communication agent drafts a check-in message for Keisha.

Why: Missed sessions without communication often signal a family considering leaving. Early outreach retains 60% of at-risk families.

Failure mode: Before this rule, a family missed 4 sessions over a month. Keisha assumed they were on vacation. They had actually switched to a competitor. She found out when the mother mentioned it casually at a school event. $4,800/year lost with zero warning.

C007 HIGH OBSERVED ONCE 5x efficiency

The release notes agent generates from merged PR titles and descriptions only. It does not read code diffs to infer what changed.

Why: The release notes agent read a diff that renamed an internal function and generated a changelog entry: "Breaking: API endpoint /auth/refresh renamed." The function rename was internal only -- the API contract hadn't changed. A user on the changelog RSS feed opened a GitHub issue asking about the "breaking change" and whether they needed to update their integration. The founder spent 30 minutes clarifying that nothing had changed for users.

Failure mode: Agents infer user-facing changes from internal code diffs. Internal refactors are misrepresented as breaking changes. Users react to non-existent breaking changes.

C008 HIGH MEASURED RESULT 10x efficiency

The support triage agent checks for duplicate issues before categorizing. Duplicates are linked, not re-triaged independently.

Why: The same bug was reported 4 times across GitHub issues and Slack over a weekend. The triage agent created 4 separate P1 entries. The founder's Monday morning check showed 4 P1 issues and he panicked, thinking 4 different critical bugs had emerged. It was 1 bug reported 4 ways. After implementing deduplication (matching by error message, stack trace similarity, and affected endpoint), false P1 volume dropped 35% over the next month.

Failure mode: Duplicate reports are triaged independently, inflating priority counts. The single reviewer overestimates severity based on volume. Panic replaces triage.

C014 MEDIUM MEASURED RESULT 6x efficiency

The code review agent includes a "confidence" indicator on each finding: HIGH (definite bug or security issue), MEDIUM (likely problem, needs human judgment), LOW (style preference, could go either way).

Why: Without confidence labels, the founder treated all code review findings equally. He either fixed everything (30 minutes on style nits) or skipped everything (missed real bugs). Confidence labels let him triage: fix all HIGHs immediately, review MEDIUMs during dedicated review time, batch LOWs for monthly style cleanup. Effective review time dropped from 25 minutes/day to 8 minutes/day with zero increase in bugs reaching production.

Failure mode: Uniform presentation of findings forces binary processing: everything or nothing. Confidence labels enable triage. Without them, the solo founder's limited review time is misallocated.

C011 MEDIUM INFERENCE 2x efficiency

Proposal drafts from Archer should be 60-70% complete, not 95%. Leave strategic positioning, pricing rationale, and the executive summary for the consultant to write.

Why: Over-polished agent drafts create a false sense of completion. Consultants rubber-stamp instead of thinking critically. The best proposals have the consultant's genuine strategic voice in the sections that matter most.

Failure mode: Archer produced a near-perfect 28-page proposal. The partner skimmed it, approved it, and sent it. The pricing section included a 15% discount that Archer inferred from a prior similar engagement but that was not appropriate for this client's scope. Cost us $31,500 in margin.

C012 MEDIUM OBSERVED ONCE 3x efficiency

Vault's knowledge base must be refreshed quarterly. Methodologies, case studies, and templates older than 18 months must be flagged for review and either updated or archived.

Why: Stale methodologies in proposals and deliverables make the firm look dated. Clients in fast-moving industries notice when frameworks reference pre-pandemic market conditions.

Failure mode: Archer pulled a "Digital Transformation Readiness Assessment" template from Vault that referenced 2023 technology benchmarks. The client's CTO pointed out the benchmarks were 3 years old during the proposal review call.

C013 HIGH INFERENCE 3x efficiency

For engagements involving direct competitors, assign different consultant teams and ensure no agent holds context for both engagements simultaneously. Rotate agent context between engagements, never run them in parallel.

Why: Even with firewalls, simultaneous context is the highest-risk vector for information leakage. Sequential processing with context clearing is safer than parallel processing with access controls.

Failure mode: This is the architectural response to the Haldane/Orion incident (C001). No second incident has occurred since implementing sequential processing.

C007 HIGH OBSERVED REPEATEDLY 7x efficiency

Lead nurture sequences pause automatically when a prospect books a trial class. Resume only if they no-show or don't convert within 7 days.

Why: Continuing to nurture someone who already booked feels tone-deaf and spammy.

Failure mode: A prospect booked a trial, received 3 more "Book your free class!" emails before attending. Replied "I already did, is anyone actually reading these?" Trial converted but trust was damaged from the start.

C008 MEDIUM OBSERVED REPEATEDLY 4x efficiency

Trainer performance metrics use a 4-week rolling average, not single-week snapshots. Seasonal patterns (New Year surge, summer dip) are normalized against the same period last year.

Why: Single-week data is noisy. A trainer with one bad week due to illness shouldn't be flagged. Seasonal patterns create false positives.

Failure mode: January metrics showed every trainer "improving" dramatically. It was just the New Year resolution surge. Jamie almost gave bonuses based on phantom performance gains.

C009 HIGH OBSERVED REPEATEDLY 7x efficiency

Location-specific context must be attached to every agent action. No agent operates in a "generic CoreFit" mode. Each location has different peak hours, demographics, and class preferences.

Why: The downtown location skews young professionals (25-35). The suburban location skews parents (35-50). Messaging that works for one alienates the other.

Failure mode: Lead nurture sent "Bring the kids to our Saturday Family Fitness!" to downtown prospects. Downtown has no kids' classes. 4 confused replies.

C015 MEDIUM OBSERVED REPEATEDLY 4x efficiency

Member retention risk scoring uses a weighted model: visit frequency (40%), class variety (20%), social engagement (15%), billing consistency (15%), tenure (10%). No single factor triggers an at-risk flag alone.

Why: Single-factor triggers produce too many false positives. A long-tenured member who drops visit frequency for 2 weeks might be on vacation, not churning.

Failure mode: Early model used visit frequency alone. Flagged 23 members as at-risk in one week. 19 were on a local school spring break vacation. Jamie wasted 4 hours reviewing false flags.

C011 HIGH OBSERVED REPEATEDLY 7x efficiency

Issue triage prioritizes by: (1) enterprise customer reports, (2) security issues, (3) issues with reproduction steps, (4) feature requests with 5+ thumbs-up, (5) everything else.

Why: Enterprise customers pay. Security issues are existential. Reproducible issues get fixed faster. Community-validated features should ship. Everything else can wait.

Failure mode: Before prioritization, issues were triaged by recency. An enterprise customer's critical bug sat at position #14 in the queue behind 13 minor feature requests. The customer escalated via email after 5 days. Kai fixed it in 30 minutes but the delayed response nearly cost the $2,400/year contract.

C012 MEDIUM OBSERVED ONCE 3x efficiency

Release notes are published within 24 hours of a release. If Kai hasn't reviewed the draft within 12 hours, the agent sends a reminder to his private Slack channel.

Why: Enterprise customers monitor releases. A release without notes triggers "what changed?" emails that cost more time than writing the notes.

Failure mode: Kai shipped v2.6.0 on a Friday and forgot to publish release notes. By Monday, 4 enterprise customers had emailed asking what changed. One customer's security team flagged the update as "unreviewed" and blocked their team from upgrading. It took 2 weeks to get through their security review after the late notes were published.

C013 MEDIUM OBSERVED ONCE 3x efficiency

Discord monitoring tracks sentiment, not just questions. A shift from positive to negative sentiment in any channel triggers a summary to Kai's private Slack within 1 hour.

Why: Developer communities turn fast. A frustrating bug or a perceived lack of responsiveness can shift tone from supportive to hostile in a single day.

Failure mode: A breaking change in v2.3.0 caused issues for 15+ users over a weekend. Discord #help went from 2 messages/day to 30 messages/day, all negative. Kai was offline and didn't see it until Monday. By then, the narrative had solidified: "DevForge ships breaking changes without warning." A community member forked the project as a "stable alternative." The fork got 200 stars before Kai could respond.

C010 HIGH OBSERVED REPEATEDLY 7x efficiency

Support ticket priority is determined by financial impact. Tickets mentioning failed payments, incorrect balances, or unauthorized transactions are auto-escalated to P1 regardless of the user's tone or language.

Why: A polite user reporting a $500 balance discrepancy is more urgent than an angry user complaining about the UI. Financial impact trumps sentiment.

Failure mode: The triage agent initially used sentiment analysis for priority. An angry user complaining about a font change was prioritized over a calm user reporting that $1,200 appeared to be missing from their account. The calm user waited 18 hours for a response. The "missing" money was a categorization display bug, but the delay eroded trust.

C011 HIGH OBSERVED REPEATEDLY 7x efficiency

API health checks run every 30 seconds for Plaid and Stripe. Degraded performance (response time >2x baseline) triggers a P2 alert. Full outage (no response for 3 consecutive checks) triggers P1.

Why: Degraded performance often precedes full outages. Early warning gives the engineering team time to activate failover or notify users before the situation becomes critical.

Failure mode: Before the degradation detection, a Stripe partial outage caused payment processing to slow from 200ms to 8 seconds. Users experienced "spinning" payment screens. 14 users abandoned mid-payment. The monitoring agent only alerted when Stripe went fully down 40 minutes later.

C012 MEDIUM OBSERVED ONCE 3x efficiency

The categorization QA agent flags systematic drift when any category's miscategorization rate exceeds 5% over a 7-day window. Drift below 5% is logged but not alerted.

Why: Individual miscategorizations are normal (merchants change names, new merchants appear). Systematic drift indicates a model problem that affects many users simultaneously.

Failure mode: A merchant data provider changed their taxonomy, causing "Groceries" to be classified as "General Merchandise" for 380 users. The drift wasn't flagged for 12 days because the old threshold was 10%. Users noticed before Greenline did. 7 support tickets in one day about "wrong categories."

C011 MEDIUM INFERENCE 2x efficiency

Demand letter drafts use the firm's established template structure and tone. The agent does not experiment with novel legal arguments or creative formatting. If the fact pattern suggests a non-standard approach, the agent flags it and defers to the attorney.

Why: Insurance adjusters read thousands of demand letters. They recognize standard, professionally structured letters as coming from competent firms. A creatively formatted letter or an experimental legal argument can signal inexperience, even if the content is sound.

Failure mode: Agent drafts a demand letter using a narrative style instead of the firm's established format. Adjuster perceives the firm as inexperienced. Initial counteroffer is 40% lower than expected. Attorney spends two additional months negotiating to reach the same number a standard letter would have achieved.

C012 MEDIUM OBSERVED REPEATEDLY 4x efficiency

Client update calls are scheduled between 10 AM and 3 PM on Tuesday through Thursday. Mondays are for internal case review. Fridays are for court appearances and depositions. The comms agent never schedules outside this window without attorney override.

Why: Attorneys need uninterrupted blocks for case preparation and court appearances. Early implementation allowed the comms agent to schedule calls at 8 AM and 4:30 PM. Attorneys were arriving at court unprepared because they had been on a client call until 15 minutes before their appearance.

Failure mode: Comms agent schedules a client call at 8:15 AM. Attorney's deposition starts at 9:00 AM. Call runs long. Attorney arrives at deposition flustered and underprepared. Opposing counsel notices and pushes harder on key points.

C013 MEDIUM OBSERVED ONCE 3x efficiency

The intake agent asks 7 standardized screening questions before classification. If a potential client cannot answer 3 or more questions, the case is classified as INCOMPLETE rather than WEAK. Incomplete cases get a 48-hour follow-up, not a decline.

Why: Trauma patients often cannot recall details during the first call. A car accident victim called 2 days post-accident and could not provide the other driver's insurance info, the police report number, or the exact location. Agent classified the case as WEAK. Attorney overrode it. Case settled for $195K.

Failure mode: Traumatized potential client provides incomplete information. Agent classifies as WEAK or DECLINE. Firm turns away a strong case because the agent prioritized data completeness over human context.

C011 MEDIUM MEASURED RESULT 6x efficiency

Listing descriptions are between 150 and 300 words. Under 150 feels thin and suggests the agent did not visit the property. Over 300 gets truncated on Zillow's mobile display. Every description includes: location context (without superlatives), key features (square footage, bedrooms, bathrooms, lot size), notable upgrades, and a neutral call to action.

Why: Analysis of 200 Denver MLS listings showed that descriptions between 150-300 words received 23% more saves than those outside this range. Zillow's mobile truncation at approximately 280 characters for the preview means the first two sentences must contain the most important information.

Failure mode: Description runs to 450 words. Zillow mobile preview shows only the first two sentences, which happen to be generic neighborhood context. Buyer scrolling on their phone never sees the renovated kitchen or the mountain views. Listing gets fewer saves and fewer showing requests.

C012 HIGH MEASURED RESULT 10x efficiency

The qualifier responds to new web leads within 15 minutes during business hours (8 AM-7 PM) and within 2 hours outside business hours. Response includes a personalized acknowledgment referencing the specific property or search criteria the lead expressed interest in.

Why: NAR research shows that leads contacted within 5 minutes are 21x more likely to convert than those contacted after 30 minutes. Our 15-minute target balances speed with personalization quality. Before automation, average response time was 4.7 hours. After: 11 minutes during business hours.

Failure mode: Lead submits interest on a Zillow listing at 10 AM. Without automated qualification, the inquiry sits in an agent's email until she checks between appointments at 2 PM. By then, the lead has received responses from three other brokerages. Lead goes with the fastest responder.

C013 LOW OBSERVED REPEATEDLY 2x efficiency

Seller reports are generated every Thursday at 4 PM and delivered by 6 PM. This timing allows sellers to review over the weekend and come to Monday meetings with questions. Reports include week-over-week showing trends, not just raw numbers.

Why: Sellers who receive reports on Monday morning feel blindsided going into the work week. Friday delivery felt end-of-week. Thursday gives sellers 72 hours to process the information before their next conversation with their agent. Seller satisfaction scores improved from 7.2 to 8.4 (out of 10) after the timing change.

Failure mode: Report delivered Monday morning shows a 40% drop in showings. Seller panics, calls agent before the agent has had coffee. Reactive conversation instead of strategic one. Agent spends 45 minutes calming the seller instead of 15 minutes discussing next steps.

C010 HIGH OBSERVED REPEATEDLY 7x efficiency

Recommendation-first behavior is preferred over question-first behavior when enough context exists to make a strong next-step proposal.

Why: The organization is explicitly designed to reduce user bottleneck load and decision fatigue.

Failure mode: If agents default to interrogation instead of recommendation, the principal becomes the routing and reasoning layer, defeating the purpose of the system.

C011 HIGH OBSERVED REPEATEDLY 7x efficiency

Agents should ask at most one focused clarifying question when a missing detail blocks action, rather than opening broad discovery loops.

Why: The system values execution momentum and low-friction interaction.

Failure mode: Over-questioning slows progress, increases user effort, and creates the sense that AI is adding coordination overhead rather than removing it.

C012 MEDIUM INFERENCE 2x efficiency

No-show prediction models must use only day-of-week, time-of-day, appointment type, weather, and visit number in sequence (first visit, second visit, etc.). Patient demographics, diagnosis, and insurance type must not be used as predictive features even if they improve model accuracy.

Why: Using diagnosis or insurance type to predict no-shows creates discriminatory scheduling practices. If the model learns that Medicaid patients no-show more frequently and double-books those slots, it's implementing economic discrimination in healthcare access. This violates both ethical standards and potentially the Civil Rights Act.

Failure mode: Hypothetical, but the constraint was added proactively after a published case study from another practice showed that using insurance type as a no-show predictor resulted in Medicaid patients being systematically double-booked, reducing their available appointment times. Kinwell's practice manager read the case study and preemptively restricted Flow's feature set.

C013 HIGH OBSERVED ONCE 5x efficiency

Beacon must include accurate clinical information in all patient education content. Every health claim must be sourced from peer-reviewed literature or professional PT association guidelines. Beacon must not generate exercise recommendations, recovery timelines, or treatment expectations without clinical review.

Why: Healthcare marketing content that includes inaccurate clinical information is a liability risk. A blog post that says "most ACL recoveries take 8-12 weeks" when the actual clinical range is 6-9 months creates patient expectations that the practice cannot meet. It also exposes the practice to malpractice claims if a patient cites the content as the basis for their treatment expectations.

Failure mode: Beacon drafted a blog post stating "heel spurs typically resolve within 4-6 weeks of physical therapy." The actual clinical consensus is that plantar fasciitis (the condition causing heel spurs) typically requires 6-12 months of conservative treatment including PT. The clinical reviewer caught it. Had the post been published, patients beginning PT for heel spurs would have expected resolution in 4-6 weeks and been dissatisfied when it took longer.

C011 MEDIUM MEASURED RESULT 6x efficiency

Maintenance requests received between 10 PM and 6 AM are held for triage until 6 AM unless the tenant explicitly states emergency language (flooding, gas smell, fire, no heat, sparking). Non-emergency overnight requests are acknowledged immediately ("received, will be reviewed at 8 AM") but not triaged or dispatched until morning.

Why: 68% of after-hours requests in our first 6 weeks were ROUTINE. Triaging them overnight triggered unnecessary overnight vendor dispatch planning. The hold-until-6-AM rule reduced after-hours vendor contacts by 71% without any increase in property damage from delayed responses.

Failure mode: Without the overnight hold, every after-hours request triggers full triage. 68% are routine but still generate 2 AM Slack notifications to Corinne. Corinne sleeps poorly. Decision quality degrades during the day. She misses a rent payment pattern that would have flagged a tenant at risk of default.

C012 LOW OBSERVED ONCE 1.5x efficiency

Tenant communications use a warm but professional tone. No exclamation points. No emojis. No slang. No first-name-only greetings. Every message includes a ticket reference number and Corinne's direct phone number for urgent follow-up.

Why: Tenants pay $1,350/month. They expect professional management, not casual text messages. Early comms agent output included "Hey Marcus! Got your request -- we're on it!" Marcus was a 62-year-old retired teacher who found the tone disrespectful. He called Mark directly to complain. Mark agreed.

Failure mode: Casual tone alienates older or more formal tenants. A 62-year-old tenant paying $16,200/year in rent does not want to receive a text that reads like it came from a college intern. Tone mismatch erodes trust. Tenant does not renew. Lost lifetime value over 4 years: $64,800.

C013 MEDIUM MEASURED RESULT 6x efficiency

The vendor agent tracks response times for all dispatched work orders. Vendors who exceed their committed response time by more than 4 hours on 3 or more occasions are flagged for review. The flag goes to Corinne, not the vendor. Vendor relationship management is human-only.

Why: Our preferred HVAC vendor was consistently 6-8 hours late on non-emergency calls. The agent flagged the pattern after 5 late responses. Corinne renegotiated the response time SLA and secured a $50 discount per late response. Without the tracking, the pattern would have gone unnoticed because each individual delay seemed reasonable.

Failure mode: Without vendor performance tracking, patterns hide in individual incidents. Each late response seems like a one-off. Over a year, the same vendor is late 15 times. Tenants associate slow repairs with poor management. Satisfaction drops. Renewal rates drop. Mark never knows the root cause is one vendor.

C012 HIGH OBSERVED REPEATEDLY 7x efficiency

Support response time target: 4 hours during semester, 24 hours during breaks. The triage agent auto-escalates any ticket older than 3 hours during semester to the #support-urgent Slack channel.

Why: Students study on deadlines. A support ticket filed at 10 PM before an exam needs a response before midnight, not the next morning.

Failure mode: A student filed a ticket at 11 PM about not being able to access a study guide. The exam was at 8 AM. Support responded at 9 AM. The student had already failed to study the material. She left a 1-star app store review: "Platform broke the night before my exam and nobody helped." The review stayed up for 6 months and was cited by 2 prospective teachers who decided not to adopt.

C013 HIGH OBSERVED REPEATEDLY 7x efficiency

Content QA prioritizes study guides that align with upcoming exam dates. Guides for exams within 2 weeks get priority review.

Why: A factual error in a guide nobody's using is low risk. A factual error in a guide that 500 students will use for tomorrow's exam is catastrophic.

Failure mode: Before priority-based QA, all content was reviewed in creation order. A new chemistry guide (exam in 3 days, 280 students) sat behind 15 older guides in the QA queue. It contained an incorrect molecular weight. Caught 6 hours before the exam by a student who filed a support ticket.

C014 MEDIUM OBSERVED ONCE 3x efficiency

Stripe billing alerts (failed payments, subscription cancellations) are routed to Priya within 1 hour. The support agent drafts a personal "we miss you" email for cancellations, but only sends if Priya approves.

Why: Most cancellations are recoverable within 48 hours. After 48 hours, the student has found an alternative and the recovery rate drops from 35% to 8%.

Failure mode: 12 students cancelled during a billing system migration. The cancellation alerts were batched and delivered 3 days later. By then, 10 of 12 had switched to Quizlet. Recovery emails were ignored. $1,440/year in lost revenue.

C013 HIGH HUMAN DEFINED RULE 5x efficiency

When the Delivery Monitor flags a project as "at risk" (velocity drop >20% for 2 consecutive sprints), the PM must acknowledge within 24 hours with a remediation plan. If no acknowledgment in 24 hours, it escalates to the SVP of Global Delivery automatically.

Why: Silent project degradation was our biggest delivery risk. PMs naturally want to "fix it internally" before escalating. The forced acknowledgment window prevents hiding.

Failure mode: A PM acknowledged the alert but submitted a boilerplate remediation plan ("will add resources next sprint") without actually investigating the root cause. The project continued to degrade. We now require the PM to cite specific Jira tickets and the root cause in the acknowledgment.

C014 MEDIUM HUMAN DEFINED RULE 3x efficiency

AI code review comments that go unaddressed for 48 hours are auto-escalated to the tech lead. This prevents PR queues from stalling because developers dismiss AI feedback.

Why: Early adoption showed developers ignoring AI comments at a 40% rate, assuming they were false positives. Some were. Many were not. The escalation creates accountability.

Failure mode: The escalation initially went to the PM instead of the tech lead. PMs did not have the technical context to evaluate whether the AI comment was valid. We rerouted to tech leads who can make the call in 5 minutes.

C015 MEDIUM HUMAN DEFINED RULE 3x efficiency

The Knowledge Navigator's staleness scoring triggers automatic review requests to documentation owners when a page has not been updated in 6 months and has been cited more than 10 times.

Why: High-citation, stale documentation is the most dangerous kind -- it is trusted precisely because it is frequently referenced, but the information may be outdated.

Failure mode: Documentation owners were overwhelmed with review requests during the initial rollout (we had 3+ years of Confluence debt). We added a priority queue based on citation frequency x staleness to focus on the most dangerous pages first.

C007 HIGH OBSERVED REPEATEDLY 7x efficiency

New agents run in shadow mode for 14 days before their output reaches anyone outside the team.

Why: Our auto-send experiment on day 3 of deploying the client update agent. It sent a "your CPL increased 34% this week" email to 12 clients at 6:47 AM on a Saturday. Three clients called within the hour. The CPL increase was real but within normal weekly variance and the email lacked context about seasonality. We turned off auto-send and haven't re-enabled it.

Failure mode: Agents send raw metrics without context to clients. Clients panic. Founder spends Saturday morning on damage control calls.

C008 HIGH OBSERVED REPEATEDLY 7x efficiency

Stale data flags must include hours-since-update, not just "stale."

Why: "Stale" means nothing. Is it 2 hours old or 2 days old? The briefing said "Dash data is stale" for our ad monitoring output. The founder assumed it was a few hours old and made decisions accordingly. It was actually 3 days old because the Meta API token had expired Friday evening and nobody noticed until Tuesday morning.

Failure mode: Ambiguous staleness labels lead to decisions based on data that's far older than assumed.

C013 MEDIUM MEASURED RESULT 6x efficiency

Weekly agent review must include a false positive rate for each alerting agent. Target: below 15%.

Why: After the 47-alerts-in-one-day incident (see C002), we started tracking false positive rates. Our Google Ads monitor was at 62% false positives -- nearly two-thirds of its alerts required no action. We recalibrated thresholds and got it to 11% over the next two weeks. The Meta monitor was already at 8%. Without tracking the rate, we wouldn't have known which agent needed calibration.

Failure mode: Without measurement, alert quality degrades silently. Teams compensate by ignoring alerts rather than fixing thresholds.

C015 MEDIUM OBSERVED REPEATEDLY 4x efficiency

Feasibility spikes at Step 4 resolve technology uncertainty before committing. AI generates a small, throwaway proof-of-concept for each uncertain technology choice. If the spike fails, the technology is eliminated.

Why: Technology selection based on documentation and AI recommendation alone is unreliable. A 2-hour spike reveals integration problems that no amount of research can predict.

Failure mode: Team selects a database technology based on AI analysis of documentation. The technology has an undocumented limitation that blocks a core use case. Discovered at Step 10. Migration required.

C016 MEDIUM INFERENCE 2x efficiency

Implementation proceeds one stakeholder slice at a time (Step 10). Each slice is reviewed by the human practitioner before the next slice begins.

Why: Large-batch implementation accumulates errors that compound. Per-slice review catches integration issues, requirement misunderstandings, and scope drift before they propagate.

Failure mode: Team implements three stakeholder slices without intermediate review. The first slice has a data model error. Slices two and three build on the error. All three require rework.

C017 HIGH HUMAN DEFINED RULE 5x efficiency

The scientific method applies to business purpose validation. Hypotheses are stated, predictions are made, tests are designed, and results are evaluated against pass/fail criteria. AI generates the test infrastructure. The human defines the hypotheses and evaluates the results.

Why: Without explicit pass/fail criteria, validation becomes subjective. "It seems to work" is not validation. "These three metrics exceeded these three thresholds" is validation.

Failure mode: Team "validates" by showing the prototype to stakeholders and asking "Does this look right?" Stakeholders approve politely. The system fails in production because polite approval is not the same as validated business purpose.

C011 MEDIUM OBSERVED REPEATEDLY 4x efficiency

The banned phrases list must be reviewed monthly. Add phrases that appear in client feedback as "generic," "consultant-speak," or "AI-sounding." Remove phrases that have been successfully avoided for 3+ months (they're internalized).

Why: Language drift is continuous. New cliches emerge. Old ones fade. A static banned list becomes irrelevant over time. The list is a living document that reflects current failure patterns.

Failure mode: The banned list went 4 months without update. During that period, Forge started using "unlock value" and "drive impact" heavily -- phrases not on the original list. A client's feedback form noted "the deliverable felt AI-generated." The founder added 8 new phrases to the banned list.

C012 MEDIUM OBSERVED ONCE 3x efficiency

For new client engagements, Scout must produce a "day zero" research brief within 4 hours of the signed SOW. This brief establishes the baseline: industry context, competitive landscape, key players, and known risks. Forge and Prep both read this brief before producing their first outputs.

Why: The first 48 hours of a new engagement set the tone. If the founder walks into the kickoff meeting without solid research, the client questions whether they made the right choice. The day-zero brief ensures every agent starts with shared context.

Failure mode: New engagement kicked off without a day-zero brief. Prep created meeting talking points based on the SOW alone (no industry context). The founder asked a question in the kickoff that revealed unfamiliarity with a major regulatory change in the client's industry. The client's General Counsel raised an eyebrow. It took 3 meetings to rebuild confidence.

C013 MEDIUM OBSERVED ONCE 3x efficiency

If founder has fewer than 3 OTP hours in a week, defer all non-build work.

Why: Low-availability weeks must protect build above everything.

Failure mode: Low-availability week spent on outreach delays timeline by 2 weeks.

C010 MEDIUM OBSERVED REPEATEDLY 4x efficiency

Proposal pricing is calculated using a blended rate model: Mara's time at $175/hr, senior designer time at $150/hr, junior designer time at $90/hr, with a 15% agency margin. The proposal agent calculates project pricing using this model and presents Mara with a price range (low/expected/high based on scope uncertainty).

Why: Mara was chronically underpricing projects because she estimated from memory. The agent's pricing model ensures every proposal covers costs and maintains margin.

Failure mode: Before the pricing model, Mara quoted a $12K brand identity project based on "feel." Actual cost (tracked post-project): $15,800 in labor. The $3,800 loss on a small agency's margins was felt for 2 months.

C011 MEDIUM OBSERVED ONCE 3x efficiency

Client feedback synthesis groups feedback into three categories: (1) Factual corrections ("The phone number is wrong"), (2) Preference statements ("I prefer the blue version"), (3) Strategic concerns ("This doesn't feel premium enough for our audience"). The creative team receives all three but is expected to address #1 immediately, consider #2, and discuss #3 with Mara before acting.

Why: Not all client feedback carries the same weight. Factual corrections are objective. Preferences are subjective. Strategic concerns may require a creative rationale, not a revision.

Failure mode: Without categorization, the creative team treated all feedback equally. A client's casual "I kind of like the blue better" (preference) was treated the same as "This doesn't match our brand positioning" (strategic concern). The team changed the color without discussing the strategic point. The client was happy with the color but dissatisfied with the positioning. Two additional revision rounds followed.

C012 MEDIUM OBSERVED REPEATEDLY 4x efficiency

The competitive visual analysis agent uses GPT-4V to analyze visual trends in competitor work. Analysis covers layout patterns, typography trends, color usage, and design language. The agent produces structured reports, not creative direction.

Why: Visual analysis requires a model capable of interpreting images. Claude handles text; GPT-4V handles visual interpretation. The two platforms serve complementary roles.

Failure mode: An early attempt to describe competitor visuals using text-only (Claude) produced vague descriptions ("clean, modern aesthetic with blue tones"). GPT-4V analysis was specific: "Competitor uses a 12-column grid with 60/30/10 color ratio, Helvetica Neue at 3 type scales, 24px base unit." The specificity made the analysis actually useful to the design team.

C006 HIGH OBSERVED REPEATEDLY 7x efficiency

Memory should be treated as a first-class operating asset, with separate components for event logging, consolidation, and retrieval.

Why: The system includes Scribe for event logging, Archivist for summary consolidation, Seeder for bootstrap summary creation, and CustomerOps memory tools for event logs, summaries, refreshes, and rebuilds.

Failure mode: Without staged memory management, later agents reprocess too much raw data, lose continuity across interactions, and make decisions on stale or fragmented context.

C007 MEDIUM INFERENCE 2x efficiency

When a contact already has usable memory, the org prefers lightweight contextual refresh over full recomputation.

Why: Lens explicitly uses different behavior on memory hit vs. memory miss, and the broader memory architecture supports incremental consolidation rather than always rebuilding from scratch.

Failure mode: Always recomputing full context increases token cost, slows response time, and creates more opportunities for inconsistency between runs.

C015 HIGH OBSERVED REPEATEDLY 7x efficiency

The org prefers narrow, structured outputs over open-ended prose for machine-to-machine handoffs.

Why: Agent descriptions repeatedly reference structured summaries, schema versions, typed fields, validator outputs, and table/output writers rather than free-text-only communication.

Failure mode: Unstructured handoffs increase ambiguity between steps, raise parsing risk, and make validators and downstream tools less effective.

C007 HIGH MEASURED RESULT 10x efficiency

For local service businesses, the agent must check search terms for geographic intent mismatches weekly, not just cost and conversion metrics.

Why: An HVAC client's campaigns looked great on paper: CPL $31, 38 leads/month. But 12 of those leads searched for "AC repair [neighboring city]" and were served ads because the radius targeting overlapped into the next town. The HVAC company doesn't service that area. They were paying $31 per useless lead for a third of their volume. The geographic search term check caught what the performance metrics missed.

Failure mode: Performance metrics look healthy while geographic targeting silently wastes budget. Local businesses serve defined areas that don't always align with radius targeting.

C008 HIGH MEASURED RESULT 10x efficiency

Client reports must include a "leads by city" breakdown for any local service business.

Why: After the plumber incident (C001) and the HVAC issue (C007), we added city-level lead breakdowns to every report. Two more geographic problems were caught in the first month: a family law firm getting leads from a state where they're not licensed, and a dentist attracting patients from 45 minutes away who never convert because the drive is too far.

Failure mode: Aggregate lead counts mask geographic distribution problems. Clients don't know their ad dollars are leaking into wrong territories until they see the breakdown.

C013 MEDIUM OBSERVED REPEATEDLY 4x efficiency

Conversion tracking status checks run daily and flag any account where conversion actions have recorded zero conversions in 48+ hours (for accounts that typically convert daily).

Why: A dentist client's Google Tag stopped firing after a WordPress plugin update. The agent saw "zero leads today" and reported it as low performance. It took 4 days for the media buyer to realize it was a tracking issue, not a performance issue. During those 4 days, the actual leads were coming in (the phone was ringing) but nothing was being attributed. Bidding algorithms degraded because they thought nothing was converting.

Failure mode: Tracking failure is misdiagnosed as performance decline. Bidding algorithms lose signal. Agents report "bad performance" when the actual problem is measurement.

C011 HIGH OBSERVED ONCE 5x efficiency

Reports generated before 7 AM use yesterday's final numbers, not partial today numbers. Never mix time windows in a single report.

Why: Partial-day data creates misleading trends. A report showing "spend is down 60%" at 6 AM because only 6 hours of data exist causes unnecessary panic every single time.

Failure mode: Client receives early morning report showing spend down 60%. Calls account manager in alarm. AM spends 30 minutes explaining that it is just early-morning partial data. Happens three times before we fix the rule.

C012 MEDIUM MEASURED RESULT 6x efficiency

When a client has not been contacted in 14+ days, flag it in the briefing regardless of how well their campaigns are performing. Silence is a churn signal even when the numbers are good.

Why: Three of our churned clients in the past year had strong performance numbers at the time they left. They did not leave because of results. They left because they felt ignored and undervalued.

Failure mode: Client campaigns perform well for 6 straight weeks. No proactive outreach from the team. Client quietly signs with a competitor who calls them every week.

C013 MEDIUM OBSERVED ONCE 3x efficiency

New agents start in shadow mode for 2 weeks minimum. They generate output that a human reviews but the team does not act on. After 2 weeks of consistently accurate output, they graduate to draft mode where output is used after human review.

Why: We deployed the prospecting agent directly into production without a shadow period. Its first batch of outreach emails included a company that was a current client's direct competitor. Two weeks of shadow mode would have caught that conflict on day 4.

Failure mode: New agent sends outreach to a prospect that has a direct conflict with an existing client relationship. Client hears about it through industry contacts. Trust damaged.

C015 HIGH OBSERVED REPEATEDLY 7x efficiency

If data is stale, flag it visibly. Never silently present old information as current.

Why: Stale data presented as current causes wrong decisions. Visible staleness lets the consumer decide how to weight the information.

Failure mode: Briefing shows yesterday's ad spend as today's. Founder makes budget decisions on wrong numbers.

C016 MEDIUM OBSERVED REPEATEDLY 4x efficiency

If 3+ tasks from one person are overdue, flag as capacity pattern, not motivation problem.

Why: Individual overdue tasks might be forgotten. A pattern of overdue tasks indicates workload exceeds capacity.

Failure mode: Manager assumes delegation is lazy. Actual problem is team member is overwhelmed. Problem worsens.

L001 MEDIUM OBSERVED ONCE 3x efficiency

Any agent doing client identification or contact audits must reconcile two sources of truth: pepper-clients.md (curated, human-approved) and Accelo list_companies with status=active (system of record). If an Accelo-active domain is not in pepper-clients.md, do NOT auto-add. Flag it to David as a PROMOTION CANDIDATE with context (company name, Accelo ID, why it's missing). David makes the call because some Accelo companies have domains intentionally excluded (e.g. goldsgym.com is a multi-franchise brand where only a few locations are Sneeze clients; qualitylearning.net is Cellebration Wellness's parent domain but unknown individual contacts still need human confirmation). Inverse check also matters: any domain in pepper-clients.md with no active Accelo company is a candidate for removal.

Why: pepper-clients.md is downstream of many agent decisions: Pepper buckets emails as CLIENT vs NOISE, Dirk suppresses cold outreach to existing customers, Dash scopes active-client analysis, Radar flags client comms. When it drifts, every downstream agent silently makes wrong calls - no error, just degraded signal. Specific risk is cold-emailing someone Sneeze is billing, which damages trust. Fix is cheap (30-second reconciliation during any client audit) and prevents a class of errors that is otherwise invisible until a client or teammate points it out.

Failure mode: Client-list audits silently misclassified active Sneeze It clients as cold prospects because pepper-clients.md (the canonical client-domain allowlist that Pepper, Dirk, Dash, and Radar all read) drifted out of sync with Accelo (the authoritative active-company record). In a 2026-04-23 audit of 1,790 contacts, 3 domains were active in Accelo but missing from pepper-clients.md: almarose.com (Alma Rose), delawaredigitalmedia.com (Delaware Digital Media white-label), studstillfirm.com (Studstill Firm). Contacts on those domains were being treated as cold prospects, which risks Dirk sending a cold email to someone Sneeze is actively billing.

C011 MEDIUM INFERENCE 2x efficiency

Support responses for annual plan customers auto-flagged YELLOW. Annual customers represent 4x monthly revenue.

Why: Sloppy response to monthly customer costs $150/year if they churn. Sloppy response to annual customer costs $1,800. Review depth should match revenue at risk.

Failure mode: Annual customer billing question gets generic response. Feels undervalued. Does not renew. $1,800 lost.

C012 HIGH MEASURED RESULT 10x efficiency

Engineering alerts suppresses repeat pages for same issue within 30 minutes. First alert pages. Subsequent alerts update the existing incident thread.

Why: A database slow query triggered 14 pages in 8 minutes. On-call engineer overwhelmed. Missed the actual resolution signal buried in noise.

Failure mode: Same issue generates 14 pages. Each interrupts the engineer. Noise drowns signal. Resolution delayed 20 minutes.

C010 HIGH MEASURED RESULT 10x efficiency

Patient education handouts are written at a 6th-grade reading level using the Flesch-Kincaid scale. The education agent checks readability before submitting for review.

Why: Our patient population includes a significant number of non-native English speakers and older adults. Early handouts scored at 10th-grade reading level. Dr. Okafor observed patients nodding along but clearly not understanding the content. Post-visit comprehension checks confirmed the gap.

Failure mode: Handout on managing hypertension uses terms like "antihypertensive regimen" and "sodium restriction protocol." Patient takes the handout home, does not understand it, and does not follow the guidance. Blood pressure remains uncontrolled at the next visit.

C011 MEDIUM MEASURED RESULT 6x efficiency

Appointment reminders for patients who no-showed their last visit include a warmer, non-judgmental tone and an explicit offer to reschedule. No mention of the missed appointment.

Why: The default reminder tone felt transactional. Patients who had already missed once responded better to "We'd love to see you" than "You have an appointment on Tuesday." Reschedule rate for prior no-shows improved from 31% to 48% after the tone change.

Failure mode: Standard reminder sent to a patient who missed their last appointment. Patient feels guilty or defensive. Ignores the reminder. No-shows again. Pattern solidifies.

C012 MEDIUM OBSERVED ONCE 3x efficiency

Onboarding documents are versioned with a date stamp in the filename. When clinical protocols change, the onboarding agent regenerates affected documents within 48 hours. Old versions are archived, never deleted.

Why: A new MA was trained using a document that referenced the old blood draw protocol (tourniquet for 60 seconds). Protocol had changed to 30 seconds two months prior. Document had not been updated. MA followed the outdated procedure for a full week before a supervising nurse caught it.

Failure mode: Outdated onboarding document trains new staff on a deprecated procedure. Staff performs the procedure incorrectly. In a primary care setting, most deprecated procedures are low-risk, but the cumulative effect of outdated training erodes clinical quality.

C013 HIGH OBSERVED ONCE 5x efficiency

Incident severity is determined by customer impact radius, not system impact. A database hiccup affecting 1 internal dashboard is P3. A 200ms latency increase affecting all customer eval jobs is P1.

Why: Internal systems failing is inconvenient. Customer-facing systems degrading is revenue-threatening.

Failure mode: Monitoring agent classified a latency spike as P3 because the internal system health dashboard showed green (it only measured error rates, not latency). 85 customers experienced 3x slower eval results for 2 hours. 15 filed support tickets. 3 enterprise customers included the incident in their quarterly vendor review. Two of those reviews resulted in "conditional renewal" status.

C014 MEDIUM OBSERVED ONCE 3x efficiency

Investor updates are published monthly, on the 5th, regardless of whether the numbers are good. Skipping a month signals that something is wrong.

Why: Investors pattern-match on communication cadence. A missed update generates more anxiety than a bad update.

Failure mode: Lina skipped the March investor update because MRR had dipped 4% (3 customers delayed renewals). Two board members texted within a week asking "everything okay?" The April update included the March data and the dip explanation, but the trust damage from the missed communication took the entire board meeting to repair.

C015 MEDIUM OBSERVED REPEATEDLY 4x efficiency

Sales demo prep agent refreshes demo environments weekly. Stale demo data that references outdated features or deprecated APIs undermines credibility during live demos.

Why: Enterprise prospects evaluate attention to detail. A demo that shows a deprecated feature signals that the product moves faster than the company can manage.

Failure mode: A demo environment showed an eval metric type that had been deprecated 2 months earlier. The prospect asked about it. The sales engineer said "oh, that's been removed." The prospect replied: "So your demo doesn't reflect your actual product? What else is out of date?" The deal took an additional 3 weeks to close and included a requirement for a "current state" audit before signing.

C012 HIGH OBSERVED REPEATEDLY 7x efficiency

Haven's first response to any customer inquiry must be sent within 15 minutes during business hours (9 AM - 6 PM ET). The response can be a draft that the CS rep reviews, but the customer must see a reply within 15 minutes. Outside business hours, the autoresponse sets expectations for next-business-day response.

Why: Speed to first response is the single highest-correlating factor with CS satisfaction scores. A fast "we're looking into this" beats a slow comprehensive answer every time. Haven's drafts are fast. Human review can happen after the first touch.

Failure mode: Before Haven, average first response time was 4.2 hours. Customer satisfaction (CSAT) was 3.4/5. After implementing the 15-minute target with Haven drafts, first response dropped to 8 minutes average. CSAT rose to 4.1/5 within 6 weeks. No other change was made during that period.

C013 MEDIUM OBSERVED ONCE 3x efficiency

Rhythm must test subject lines on a 10% sample before full send for any campaign going to more than 5,000 contacts. The winning subject line (by open rate after 2 hours) goes to the remaining 90%. No exceptions for "time-sensitive" campaigns.

Why: A 5% improvement in open rate on a 28,000-person list is 1,400 additional opens. On Threadline's average click-to-open rate of 18%, that's 252 additional clicks. At a 3.2% conversion rate, that's 8 additional orders averaging $67 each -- $536 in revenue from a 2-hour wait.

Failure mode: The marketing coordinator overrode the A/B test for a Black Friday campaign because "we need to send now, every minute counts." The chosen subject line had a 14% open rate. The founder ran the unused B variant to a test segment later: 23% open rate. Estimated lost revenue from skipping the test: $4,800 on Black Friday, the single highest-revenue email day of the year.

C010 HIGH OBSERVED REPEATEDLY 7x efficiency

Quarterly investor reports are prepared 21 days before distribution. The first 7 days are for agent drafting and internal review. The next 7 days are for Chen's compliance review and outside counsel if needed. The final 7 days are buffer for revisions.

Why: Rushing quarterly reports produces the C003-type errors. The 21-day cycle ensures every number is audited, every statement is compliant, and there is time to fix problems.

Failure mode: Before the 21-day cycle, Q3 reports were prepared in 5 days. The C003 incident (preliminary vs. audited IRR discrepancy) happened because there was no time for Derek to complete the audit reconciliation before distribution.

C011 HIGH OBSERVED REPEATEDLY 7x efficiency

Deal memos include a mandatory "Risk Factors" section with a minimum of 8 risk factors. The deal memo agent generates risk factors from a master risk taxonomy and adds deal-specific risks identified during analysis.

Why: Insufficient risk disclosure in offering materials creates legal liability. If an investor loses money on a risk that was foreseeable but undisclosed, the liability falls on the fund.

Failure mode: An early deal memo had 3 risk factors, all generic ("Market conditions may change," "Past performance does not guarantee future results," "Real estate is illiquid"). Chen added 9 deal-specific risks including environmental remediation liability, tenant concentration risk, and interest rate sensitivity. After this, the minimum was set at 8 with mandatory deal-specific analysis.

C012 MEDIUM OBSERVED ONCE 3x efficiency

Investor communications use tiered language precision based on the content type. Performance updates: exact numbers with 2 decimal places and data source attribution. Market context: ranges and qualifiers ("approximately," "in the range of"). Outlook: conditional language only ("if market conditions persist," "subject to").

Why: Precision signals competence. But false precision on uncertain topics signals naivete or deception. An investor who reads "We project 14.7% IRR" treats it as a promise. "Under base case assumptions, projected returns range from 12-16% IRR" is honest.

Failure mode: The investor comms agent drafted a year-end letter stating "Our portfolio returned 11.4% in 2025." Derek's audited number was 11.38%. The rounding was correct, but the letter didn't cite the audited source. Chen added "Based on audited Q4 2025 financials prepared by [Auditor Name]" to every performance figure.

C013 MEDIUM OBSERVED REPEATEDLY 4x efficiency

When using GPT for creative agents (Cadence, Chorus) and Claude for analytical agents (Pulse, Signal, Atlas, Ledger, Prism), maintain separate evaluation criteria. Creative agents are evaluated on brand voice consistency and engagement metrics. Analytical agents are evaluated on accuracy and signal-to-noise ratio. Never evaluate a creative agent on precision or an analytical agent on tone.

Why: The two platforms were chosen for different strengths. Evaluating both with the same rubric incentivizes the wrong behaviors -- GPT agents get over-optimized for accuracy (killing creativity) and Claude agents get prompted for engaging tone (introducing imprecision).

Failure mode: The team applied a single "quality score" rubric to all agents. Chorus (GPT, creative) scored low on "factual accuracy" because product descriptions included aspirational language. The team tried to make Chorus more precise, which killed the brand voice. Meanwhile, Pulse (Claude, analytical) scored low on "engaging presentation." The team added formatting requirements that made Pulse's alerts harder to scan quickly. Both agents got worse by being evaluated on the wrong criteria.

C014 HIGH OBSERVED REPEATEDLY 7x efficiency

Agent context switches between brands must include a "brand flush" step: clear the prior brand's context, load the new brand's configuration file, and confirm the brand identity in the output header. No "carry-over" operations where an agent finishes Brand A work and immediately starts Brand B work without context clearing.

Why: Context carry-over is the root cause of voice bleed, data leakage, and policy confusion. The 30-second cost of a brand flush is negligible compared to the cost of any cross-brand contamination incident.

Failure mode: The adventure-tee incident (C001), the email cross-contamination (C002), and the CS tone mismatch (C006) all traced back to context carry-over. Implementing mandatory brand flush reduced cross-brand incidents from 4-6 per month to 0-1 per month within the first 30 days.