· 5 min read
Agent teams generate useful output on the first try. But "useful" and "excellent" are different things. The first run reveals what the team can do. Iteration reveals what it should do.
The problem is that most people either accept the first output as-is or tweak prompts randomly hoping something improves. Neither approach is efficient.
This framework gives you a structured path from first run to polished output in exactly three iterations. Three runs is the sweet spot — enough to reach high quality, few enough to avoid diminishing returns.
Before you iterate, you need to know what to evaluate. Score every agent team output on these four dimensions:
Rate each dimension as strong, adequate, or weak. The weakest dimension is your target for Run 2.
Run the agent team exactly as configured. Don't modify anything. This is your baseline.
Read the full output and score each dimension. Be specific about what's weak and why.
Example baseline evaluation for a competitive analysis team:
Weakest dimension: Actionability
The baseline tells you what the team naturally produces well and where it falls short. Most teams are strong on breadth and weak on either depth or actionability. That's because default prompts tell agents what to analyze but rarely specify how specific and actionable the output should be.
Modify the prompts to directly address the weakest dimension identified in Run 1. Change only what's needed to fix the weak area — don't overhaul everything.
Weakest dimension: Actionability
Before (baseline prompt for Strategy Synthesizer):
Synthesize the competitive analysis from all agents into a strategic recommendation. Identify key themes and suggest how we should respond.
After (targeted fix):
Synthesize the competitive analysis from all agents into a strategic recommendation. For each recommendation, specify: (1) the exact action to take, (2) which team owns it, (3) the expected timeline, and (4) how to measure success. Prioritize recommendations by impact. Generic advice like "differentiate" or "innovate" is not acceptable — every recommendation must be specific enough that a team could start executing it tomorrow.
Re-run the team with the modified prompts. Score all four dimensions again and compare to baseline.
Example Run 2 evaluation:
Targeted fixes usually produce dramatic improvement in the weak dimension without degrading others. If fixing one dimension weakens another (e.g., pushing for actionability reduces depth), note it — you'll address it in Run 3.
Fine-tune output format, add quality criteria, and optimize the synthesis prompt. This is where you go from "strong" to "excellent."
Three things to adjust in the polish run:
Tell agents exactly how to structure their output. Tables, bullet points, headers, and specific section requirements eliminate ambiguity.
Before:
Analyze each competitor's pricing strategy.
After:
For each competitor, produce a pricing analysis in this format:
- Pricing model: (per-seat, usage-based, flat-rate, hybrid)
- Entry price: (lowest published tier)
- Enterprise price: (highest tier or custom pricing indicators)
- Key differentiator: (what makes their pricing strategy distinctive)
- Vulnerability: (where their pricing creates an opening for us)
Add explicit standards that agents must meet. This prevents regression on dimensions you've already fixed.
Example addition to any agent prompt:
Quality standards: Every claim must include specific evidence. Every recommendation must include a concrete next step. Do not use vague quantifiers like "significant" or "many" — use numbers or ranges.
The synthesis prompt has the biggest impact on final output quality. For Run 3, make it comprehensive.
Before:
Combine the agent outputs into a final report.
After:
Produce the final competitive analysis report. Structure: Executive Summary (5 bullet points max), Competitor Profiles (one section per competitor using the standardized format), Strategic Recommendations (top 5, prioritized, each with specific action/owner/timeline/metric), and Open Questions (what we need to investigate further). Before writing, identify any contradictions between agent outputs and resolve them explicitly. The report should be usable in a leadership meeting without additional context.
Score all four dimensions one final time. Compare against both Run 1 (baseline) and Run 2 (targeted fix).
The improvement curve for prompt iteration follows a predictable pattern:
After three runs, you have a well-tuned team configuration that you can reuse. Save those prompts. The next time you need a competitive analysis (or whatever the team does), you start from Run 3 quality, not Run 1.
| Run | Focus | Time | Expected Improvement |
|---|---|---|---|
| 1 | Baseline — run as-is, evaluate | 5 min | Establishes starting point |
| 2 | Targeted fix — address weakest dimension | 10 min | 30-50% quality improvement |
| 3 | Polish — format, quality criteria, synthesis | 15 min | 15-25% additional improvement |
Total time: 30 minutes to go from a default team to a polished, reusable configuration.
That's a small investment for a team you might run dozens of times. And every run after the third benefits from the prompt improvements you've already made.