Test-Driven Termination: Making AI Agents Stop When They're Done
The Setup
I have a harness that runs LLM agents through a generate → evaluate → fix loop. The job: port 23 microservices from one platform architecture to another. Each service has acceptance tests already written. The harness runs unattended overnight on a budget cap.
The loop looks obvious:
- Feed the LLM a prompt with reference code, acceptance tests, and capability signatures.
- Run pytest against the output.
- If tests pass, done. If not, feed the failures back and try again. Up to 5 attempts.
The assumption: each retry should be strictly better. The model gets more information (specific failures), so it converges.
It doesn't.
The Problem: Retries Are a Random Walk
What actually happens: the LLM reads failing test output, identifies the broken behavior, rewrites code to fix it. But in rewriting, it refactors adjacent code. A method signature changes. An import moves. A helper gets inlined. Tests that were passing — tests that depended on code that just got refactored — now break.
On overnight runs, I watched pass counts oscillate:
Attempt 1: 8/17 passed
Attempt 2: 12/17 passed
Attempt 3: 10/17 passed ← regression
Attempt 4: 13/17 passed
Attempt 5: 11/17 passed ← regression
The model isn't monotonically improving. It's a random walk with a slight upward bias. Run out of retries on a down-swing and you ship worse code than you had two attempts ago.
This isn't an edge case. Without the fix described below, ~30% of tasks shipped code worse than their attempt-2 output.
BestTracker
The fix is a high-water mark. Track the best result across all attempts. If a new attempt is worse, revert.
class BestTracker:
"""Track best attempt across retries. Never go backwards."""
def __init__(self):
self.best_code: str | None = None
self.best_passed: int = -1
self.best_failed: int = float('inf')
self.best_result: EvalResult | None = None
def update(self, code: str, result: EvalResult) -> bool:
"""Record a new attempt. Returns True if this is the new best."""
passed = result.total - result.failed - result.errors
if passed > self.best_passed or (
passed == self.best_passed and result.failed < self.best_failed
):
self.best_code = code
self.best_passed = passed
self.best_failed = result.failed
self.best_result = result
return True
return False
Comparison logic: prioritize more passes. Tiebreaker: fewer failures (errors are worse than failures — the test couldn't even run).
The important part is what happens after a regression:
current_code = output_file.read_text()
is_better = tracker.update(current_code, result)
if not is_better and attempt > 1:
log.warning("Attempt %d regressed (%d→%d passed). Reverting to best.",
attempt, tracker.best_passed, passed_count)
output_file.write_text(tracker.best_code)
The best version goes back to disk before the next fix attempt. The next fix prompt shows the LLM the best code ever produced — not the regressed version. The fix starts from a known-good foundation.
This is a ratchet. The system only moves forward.
After all retries exhaust:
if tracker.best_code and output_file.exists():
output_file.write_text(tracker.best_code)
log.info("Final: best version had %d/%d passed",
tracker.best_passed,
tracker.best_result.total if tracker.best_result else 0)
Last attempt regressed? Doesn't matter. You always ship the high-water mark.
Focused Fix Prompts
My first fix implementation dumped the entire pytest output into the prompt. Predictable failure: the LLM tried to fix everything at once, touched too much code, caused cascading regressions.
The fix: extract only the failed test names, filter the pytest output to relevant sections:
def _extract_failed_tests(pytest_output: str) -> list[str]:
"""Extract names of failed/errored tests from pytest output."""
failed = []
for line in pytest_output.splitlines():
m = re.match(r"(?:FAILED|ERROR)\s+\S+::(\S+)", line)
if m:
failed.append(m.group(1))
return failed
The fix prompt itself:
Fix the following test failures in this flux-v4 service code.
## pytest output (these tests must pass)
{filtered_output}
## Current service code (fix this file)
{code}
## Acceptance tests (read-only, do NOT modify these)
{test_code}
Rules:
- Fix the service code so ALL tests pass
- Do NOT change the test file, only fix the service implementation
- Output the complete fixed service file
"Do NOT change the test file" is load-bearing. Without it, LLMs occasionally rewrite the test to match their broken implementation. The tests are ground truth. Code adapts to them, never the reverse.
The Dual Pipeline
First attempt gets the full treatment. Retries skip the expensive parts.
First attempt (write → review → rewrite):
- Write (Sonnet): Generate from reference code + acceptance tests + gold standard examples.
- Review (Flash): Cheap model checks 6 structural rules — correct imports, return types, logging patterns, handler signatures.
- Rewrite (Sonnet): If review found violations, fix them before running tests.
if task["type"] == "service_port":
review_prompt = _build_review_prompt(code)
review_response, review_cost = call_llm(router.REVIEW_TIER, review_prompt)
review_text = review_response.strip()
if review_text.upper() != "CLEAN":
rewrite_prompt = _build_rewrite_prompt(task, code, review_text)
rewrite_response, rewrite_cost = call_llm(tier, rewrite_prompt)
code = _extract_code(rewrite_response)
Attempts 2–5 (fix only): Extract failures → build focused prompt → call LLM → extract code. No review.
Why: review catches structural violations (wrong imports, missing factory function) — first-attempt problems. By attempt 2, structure is established. Remaining failures are logic bugs, off-by-ones, wrong method signatures. Review won't catch those and just burns budget.
Model Routing
_ROUTES: dict[str, tuple[str, str, int]] = {
"cheap": ("gemini", "gemini-2.5-flash", 16384),
"mid": ("anthropic", "claude-sonnet-4-5-20250929", 8192),
"frontier": ("anthropic", "claude-opus-4-6-20250616", 16384),
}
REVIEW_TIER = "cheap"
- Flash ($0.15/M input, $0.60/M output): Review step. Fast and cheap enough to use speculatively — even if the code is clean, review costs fractions of a cent.
- Sonnet ($3/M input, $15/M output): Code generation and fixes. The workhorse.
- Opus ($15/M input, $75/M output): Architecture-level decisions. Not in the retry loop.
Cost estimated at call time from token counts:
def estimate_cost(provider: str, input_tokens: int, output_tokens: int) -> float:
rates = _COST_PER_M.get(provider, _COST_PER_M["gemini"])
return (input_tokens * rates["input"] + output_tokens * rates["output"]) / 1_000_000
Every call logged with cost. Running total per task. Guardrails check before every execution.
Budget and Guardrails
The harness runs without a human. Hard constraints, not suggestions:
BUDGET_CAP_WORKDAY = 10.0 # USD
BUDGET_CAP_OVERNIGHT = 30.0 # USD
FORBIDDEN_GIT_OPS = [
"push --force", "push -f", "reset --hard", "clean -f", "branch -D",
]
FORBIDDEN_DEPLOY = [
"gcloud functions deploy", "gcloud run deploy", "gcloud app deploy",
]
Budget cap triggers immediate stop. Not "after this task" — now. You do not want to discover Monday morning that an overnight run burned $200 retrying an impossible task.
Forbidden operations aren't LLM instructions. They're harness-level checks the LLM never sees. It can't talk its way around them. GuardrailError kills the task immediately.
All file writes must be under /opt/flux-v4. No prod deploys. No destructive git. No force pushes.
Code Extraction
LLMs return code in unpredictable wrappers. Sometimes markdown fences, sometimes raw text, sometimes explanatory prose with code buried in it. The extractor validates with ast.parse():
def _extract_code(response: str) -> str:
match = re.search(r"```(?:python)?\s*\n(.*?)```", response, re.DOTALL)
if match:
code = match.group(1).strip()
ast.parse(code) # SyntaxError if invalid
return code
candidate = response.strip()
ast.parse(candidate) # reject if not valid Python
return candidate
If it doesn't parse, it's not code. Reject it. This catches the case where the model returns "Here's the fixed version:" followed by a partial implementation.
The Evaluator
The test runner itself is straightforward. The design decisions are in what it returns and how the retry loop uses it:
@dataclass
class EvalResult:
passed: bool
total: int
failed: int
errors: int
output: str # raw pytest output — fed back into fix prompts
def run_tests(test_path: str, timeout: int = 120) -> EvalResult:
result = subprocess.run(
["python3", "-m", "pytest", str(full_path), "-v", "--tb=short", "-q"],
capture_output=True, text=True, timeout=timeout,
cwd=str(BASE_DIR),
)
output = result.stdout + result.stderr
# Parse pytest summary: "X passed, Y failed, Z errors"
...
return EvalResult(
passed=result.returncode == 0 and failed == 0 and errors == 0,
total=total, failed=failed, errors=errors, output=output,
)
120-second timeout per test run. Timeout returns as an error. The output field carries raw pytest text forward into fix prompts — the model needs to see actual tracebacks, not summaries.
Beyond the Retry Loop: Evolution Scoring
The harness handles individual services. The broader system — the evolution engine — runs multiple agents against the same challenge and scores on three axes:
| Axis | Points | Method |
|---|---|---|
| Automated tests | 0–63 | pytest pass count |
| Principle adherence | 0–20 | LLM judge, 10 principles × 0-2 points |
| Scar tissue | 0–9 | grep-based checks for known pitfalls |
Scar tissue checks are the simplest layer. Grep for hardcoded API keys. Grep for missing timeout parameters. Grep for bare except:. Each is a regex. Each catches something an LLM will do if you don't penalize it.
The principle judge is more dangerous. Early runs produced 20/20 for code that clearly violated principles — the LLM judge was rubber-stamping. Fix: threshold gating (only judge agents above 50% on automated tests) and calibration examples (show what 0, 1, and 2 look like per principle).
The most dramatic result: one agent passed 61/63 tests. Nearly perfect functionally. Scored 2/20 on principles — raw SDK calls everywhere instead of neutral-intent wrappers. Final: 68/92. The winner: 63/63 tests, 18/20 principles, total 86/92.
An 18-point gap on code that did the same thing.
If you're scoring LLM output only on "does it work," you're measuring the wrong thing.
What Running This Taught Me
The first attempt matters more than the retry loop. A well-crafted prompt with gold standard examples, acceptance tests, and capability signatures passes 60-80% of tests on attempt 1. The retry loop is cleanup. Invest in prompt engineering over retry engineering.
Focused failures beat full dumps. Showing all 17 test results when 3 are failing causes the model to "fix" things that aren't broken. Filter. Name the specific failures. Let it scope changes.
Review is cheap insurance — on the first pass only. Flash review catches structural violations that would fail every test. $0.001 to save a wasted $0.15 Sonnet call. But on retry #3, you're past structural problems — review adds nothing.
Budget caps are non-negotiable. One stuck service with 5 retries costs $3-5. Twenty-three services. Without caps, $100+ overnight.
Language barely matters for functional scores. I ran evolution competitions across Python, Go, TypeScript, and Rust. All achieved similar test pass rates. The differentiation was in principle adherence and architectural style. The agent with the "simplest" genome ("import these three functions from the shared library") beat the agent with the most sophisticated architecture ("hexagonal, dependency injection, adapter pattern") because the LLM could follow one instruction literally and couldn't reliably construct the other from a description.
Infrastructure complexity is a disqualifier. One Go agent had the most sophisticated architecture on paper. Scored 9/92 because the Docker multi-stage build + Cloud Run deploy chain didn't complete within the timeout. Never even got evaluated. Friction kills agents the same way it kills developers — just faster.
The Point
The industry conversation about agents is about capabilities. What can they do. How smart. Which benchmark.
Nobody talks about:
- When do you stop them?
- How do you prevent regression?
- How do you score across dimensions that matter?
- What happens when more compute makes things worse?
BestTracker is 19 lines. The router is 37. Budget guards are two constants. The code extractor is 15 lines with an ast.parse(). None of it is intellectually difficult.
All of it is necessary. None of it is obvious until you've watched an overnight run burn $30 producing code worse than what you started with.
Running LLM agents in production requires the same discipline as any production system: structured evaluation, regression prevention, budget controls, guardrails that can't be talked around, and the humility to revert when your system is making things worse.
The agents are the easy part. Making them stop is the engineering.