Amplify Collective
gold L7 Background Agentscore operating rules
Claude agents and GPT agents never communicate directly. All cross-model data flows through a shared JSON schema file with explicit field definitions and validation.
Why: When the ad copy generator (GPT) needed performance data to write "winning angle" variations, we initially passed Claude's analysis as a natural language summary. GPT interpreted "strong performance on pain point messaging" as license to generate copy about physical pain for a fitness client. The client sold personal training, not physical therapy. The generated headline: "Stop Living in Pain -- Book Your Session Today." The creative director caught it in review. The root cause: natural language summaries lose precision across model boundaries.
Failure mode: Natural language context transfer between models introduces interpretation drift. Each model fills gaps with its own assumptions. Downstream output diverges from source intent.
Scope: All cross-model communication.
Every piece of generated ad copy must pass through a fact-verification layer that checks claims against a client-approved fact sheet before entering any review queue.
Why: GPT generated a testimonial-style ad: "After 6 weeks with [Client], I lost 32 pounds and feel amazing!" No such testimonial existed. The client had never published weight loss claims. The ad copy looked plausible and polished. An account manager approved it without checking. The ad ran for 3 days on Meta before the client's compliance officer flagged it. The client is a medical weight management clinic regulated by state advertising rules. Fabricated testimonials are not just brand risk -- they're legal exposure.
Failure mode: Generative models produce plausible but fabricated claims. Humans approve polished output without verifying underlying facts. Regulated industries face legal risk from AI-generated content.
Scope: All creative generation agents.
Client fact sheets are updated quarterly and include: approved claims, prohibited claims, regulatory restrictions, competitive positioning statements, and testimonial sources with dates.
Why: The fabricated testimonial incident (C002) revealed we had no single source of truth for what a client's ads could and couldn't say. Account managers carried this knowledge in their heads. When agents generate creative, they need the same guardrails in structured, machine-readable form. After building fact sheets for all 50+ clients, the fact-check layer caught 14 additional problematic claims in the first month -- including a "guaranteed results" claim for a client whose contract explicitly prohibits guarantee language.
Failure mode: Without structured fact sheets, creative agents generate from general knowledge instead of client-specific constraints. Prohibited claims surface as plausible copy.
Scope: All creative generation agents.
agent roles and authority
Claude's analysis agents own the "what happened" and "what the data says." GPT's creative agents own "how to say it." Neither encroaches on the other's domain.
Why: We initially gave GPT's ad copy agent access to raw Meta Ads data so it could "write data-informed copy." It started including performance claims in ad text: "Our campaigns deliver 3x ROAS" -- pulling the number from another client's data. The model conflated data access with permission to cite. Removing raw data access and providing only structured briefs from Claude's analysis agents eliminated this class of error.
Failure mode: Creative agents with data access cite performance numbers that may be from wrong accounts, outdated, or confidential. Data access is conflated with citation authority.
Scope: All creative agents.
The client health scoring agent flags risk. The account manager owns the relationship response. The agent never suggests what to say to the client.
Why: The health scoring agent flagged a fitness franchise client as "high churn risk: CPL up 28%, no call scheduled in 3 weeks, last email unanswered." It also suggested: "Consider offering a rate reduction to retain." The account manager forwarded the agent's recommendation verbatim in an internal thread. The founder saw it and asked why we'd offer a discount to a client who hadn't complained. The client wasn't actually unhappy -- they were at a fitness industry conference and had told the AM they'd be offline for 2 weeks. The agent didn't know about the conference. The AM did but deferred to the agent's scoring.
Failure mode: Agents lack offline context (conversations, events, relationships). Recommendations based on data alone miss human context. Team members defer to agent recommendations instead of applying their own knowledge.
Scope: All health scoring and recommendation agents.
coordination patterns
The briefing agent compiles from shared state files in a fixed order: spend pacing first, then alerts, then pipeline, then creative status. Order is consistent daily.
Why: Random ordering caused the founder to miss urgent items buried in the middle of the briefing. When spend pacing alerts always come first, the founder scans the top section in 30 seconds and knows if any account needs immediate action. Creative status at the bottom is read during a different workflow (creative review meeting). Fixed order matches the founder's decision-making cadence.
Failure mode: Variable briefing structure forces the reader to hunt for urgent items. Information architecture should match the reader's priority hierarchy.
Scope: Briefing and reporting agents.
When a Claude analysis agent and a GPT creative agent work on the same client simultaneously, the analysis must complete and write to shared state before the creative agent reads it. No parallel execution on the same client.
Why: The ad copy generator read mid-write shared state data for a healthcare client. The analysis was half-complete -- it had processed 3 of 7 campaigns. The copy generator used the partial data to produce "new angle" copy that leaned into the wrong service line. The 3 campaigns processed first were all for Botox. The remaining 4 were weight management, which was the client's priority. The generated copy was 100% Botox-focused.
Failure mode: Parallel execution across model boundaries creates race conditions on shared state. Partial data produces skewed output.
Scope: All agents sharing state within a client context.
operational heuristics
GPT creative output undergoes a mandatory 2-hour hold before entering the review queue. No same-session generation and approval.
Why: Pattern: account managers reviewing GPT output immediately after generation approved it at a 94% rate. When we added a 2-hour hold (the AM reviews the batch later in the day or the next morning), the approval rate dropped to 71%. The 23% gap was copy that "sounded good in the moment" but had issues visible with fresh eyes -- subtle tone mismatches, claims that were technically true but misleading, and formatting that didn't match the client's brand voice.
Failure mode: Same-session review creates familiarity bias. The reviewer just saw the brief, the context is loaded, and the output feels like a natural continuation. Distance improves judgment.
Scope: All creative review workflows.
Track and report the cross-model error rate separately from single-model error rates. Any task that involves both Claude and GPT is measured as a distinct category.
Why: Our overall agent error rate was 3.2%. When we segmented, single-model tasks (Claude-only or GPT-only) had a 1.8% error rate. Cross-model tasks had an 8.7% error rate -- nearly 5x. The errors were concentrated at handoff points: incomplete context transfer, schema mismatches, and misinterpretation of structured fields. Without segmenting, the 3.2% blended rate masked a systemic handoff problem.
Failure mode: Blended error rates hide that cross-model handoffs are the primary failure point. Resources are allocated to improving individual agents when the real problem is the integration layer.
Scope: All cross-model workflows.
failure patterns
When GPT generates content that fails fact-checking, log the failure type (fabricated claim, wrong client data, prohibited language, tone mismatch) and review monthly for patterns.
Why: After 3 months of logging, we found that 62% of GPT fact-check failures were fabricated social proof -- testimonials, case study numbers, and "as seen in" claims that didn't exist. Armed with this pattern, we added a pre-generation instruction to GPT: "Do not generate testimonials, case study results, or media mentions unless they appear verbatim in the client fact sheet." Fabricated social proof failures dropped 84% the next month.
Failure mode: Without categorized failure logging, the same error types recur. Generic "be more accurate" prompting doesn't target the specific failure mode.
Scope: All creative generation agents.
When a cross-model handoff fails, the receiving model must reject the input and report the schema violation. It must never improvise with missing fields.
Why: The creative brief schema requires a "tone" field (professional, casual, urgent, educational). When a brief arrived without the tone field due to a schema version mismatch, GPT defaulted to "casual" -- its training default. The client was a law firm. The generated ad copy opened with "Hey there! Need a lawyer?" The account manager caught it, but the failure revealed that missing fields trigger model defaults rather than errors.
Failure mode: Missing schema fields are silently filled by model defaults. Defaults reflect training distribution, not client requirements. Casual tone is GPT's most common training context.
Scope: All cross-model handoff points.
API failures on one platform (Meta or Google) must not block reporting on the other platform. Each platform's monitoring runs independently.
Why: An early architecture decision chained Meta and Google monitoring sequentially. When Meta's API went down for 4 hours on a Tuesday morning, Google Ads monitoring was also blocked because it waited for Meta to complete. We missed a Google Ads account that had exhausted its daily budget by 9 AM due to a bidding error. Cost: $1,100 in wasted spend before the media buyer checked manually at noon.
Failure mode: Sequential dependencies between independent data sources create cascading failures. One platform's outage blinds monitoring on unrelated platforms.
Scope: Multi-platform monitoring agents.
human ai boundary conditions
The creative director has veto power over any GPT-generated ad copy, regardless of account manager approval. No creative ships without CD review for accounts over $10K/mo.
Why: Account managers optimize for speed and client satisfaction. The creative director optimizes for brand integrity and long-term positioning. When AMs could approve and ship creative without CD review, three clients ended up with nearly identical ad copy structures (GPT's favored patterns). A client noticed their competitor (also our client) had similar-sounding ads. The creative director now reviews all high-spend accounts.
Failure mode: Without creative authority oversight, generative AI produces converging output across clients. Clients sharing a market notice the similarity. Agency credibility suffers.
Scope: All creative generation and approval workflows.
No agent may create, modify, or delete HubSpot deal records. Agents read HubSpot. Humans write to HubSpot.
Why: Founding rule based on the VP of Media's experience at a previous agency where an automation moved 8 deals to "Closed Lost" based on a date-based rule that didn't account for extended negotiation timelines. Two of those deals were actively in conversation. The sales rep discovered the status change 3 days later and had to re-open them manually. One prospect noticed the "Closed" status in a shared HubSpot view and asked if they should look elsewhere.
Failure mode: Automated CRM writes based on rules or heuristics override active human judgment. Prospects and clients see status changes that don't reflect reality.
Scope: All agents with CRM access.
core operating rules
Model version changes (GPT-4 to GPT-4o, Claude 3 to Claude 3.5) require a 1-week shadow comparison before full cutover. Output from both versions runs in parallel and is compared.
Why: When we upgraded from GPT-4 to GPT-4o for creative generation, the output style shifted noticeably -- shorter sentences, more emoji suggestions, different headline structures. Three clients commented on the "new tone" in the first week. We hadn't noticed because the output was still high quality -- just different. The change was invisible to us but visible to clients who'd been seeing consistent messaging for months.
Failure mode: Model upgrades change output characteristics in ways that are invisible to operators but visible to end recipients. Consistency is a feature that model upgrades can silently break.
Scope: All agents during model version transitions.
operational heuristics
Every agent must include its model name and version in the metadata of its shared state file output.
Why: When debugging an analysis discrepancy, we couldn't tell which model version produced a particular output. The ad monitor had been running on Claude 3 Opus while the pacing agent had been upgraded to Claude 3.5 Sonnet. Their outputs used different rounding conventions, making numbers mismatch by $1-3 per metric. Model version in metadata would have identified the discrepancy source in minutes instead of the 2 hours it took.
Failure mode: Without model version tracking, debugging cross-agent discrepancies requires testing each agent individually. Root cause identification is slow when version differences aren't visible.
Scope: All agents.
human ai boundary conditions
Any communication to a client that includes performance data must be reviewed by the media buyer who manages that account, not just the account manager.
Why: An account manager sent a GPT-drafted email that cited "your CPA dropped 15% this month." The media buyer managing that account knew the CPA drop was because they'd shifted budget from prospecting to retargeting -- the CPA looked better but new customer acquisition had actually declined. The client replied asking to scale the "winning strategy," which would have meant cutting all prospecting. The media buyer intervened and reframed the conversation.
Failure mode: Accurate data presented without strategic context creates false narratives. Clients make budget decisions based on metrics that look good in isolation but mask underlying trade-offs.
Scope: All client-facing communication agents. ---
Compare with Another OOS
Search for an organization to compare against.