Skipping Rock AI

Watch it bounce

Test-Driven Termination: Making AI Agents Stop When They're Done

Idea and structure: 100% human
AI-written: 85%
Code: from actual production


The Setup

I have a harness that runs LLM agents through a generate → evaluate → fix loop. The job: port 23 microservices from one platform architecture to another. Each service has acceptance tests already written. The harness runs unattended overnight on a budget cap.

The loop looks obvious:

The assumption: each retry should be strictly better. The model gets more information (specific failures), so it converges.

It doesn't.


The Problem: Retries Are a Random Walk

What actually happens: the LLM reads failing test output, identifies the broken behavior, rewrites code to fix it. But in rewriting, it refactors adjacent code. A method signature changes. An import moves. A helper gets inlined. Tests that were passing — tests that depended on code that just got refactored — now break.

On overnight runs, I watched pass counts oscillate:

Attempt 1: 8/17 passed
Attempt 2: 12/17 passed
Attempt 3: 10/17 passed  ← regression
Attempt 4: 13/17 passed
Attempt 5: 11/17 passed  ← regression

The model isn't monotonically improving. It's a random walk with a slight upward bias. Run out of retries on a down-swing and you ship worse code than you had two attempts ago.

This isn't an edge case. Without the fix described below, ~30% of tasks shipped code worse than their attempt-2 output.


BestTracker

The fix is a high-water mark. Track the best result across all attempts. If a new attempt is worse, revert.

class BestTracker:
    """Track best attempt across retries. Never go backwards."""

    def __init__(self):
        self.best_code: str | None = None
        self.best_passed: int = -1
        self.best_failed: int = float('inf')
        self.best_result: EvalResult | None = None

    def update(self, code: str, result: EvalResult) -> bool:
        """Record a new attempt. Returns True if this is the new best."""
        passed = result.total - result.failed - result.errors
        if passed > self.best_passed or (
            passed == self.best_passed and result.failed < self.best_failed
        ):
            self.best_code = code
            self.best_passed = passed
            self.best_failed = result.failed
            self.best_result = result
            return True
        return False

Comparison logic: prioritize more passes. Tiebreaker: fewer failures (errors are worse than failures — the test couldn't even run).

The important part is what happens after a regression:

current_code = output_file.read_text()
is_better = tracker.update(current_code, result)

if not is_better and attempt > 1:
    log.warning("Attempt %d regressed (%d→%d passed). Reverting to best.",
                attempt, tracker.best_passed, passed_count)
    output_file.write_text(tracker.best_code)

The best version goes back to disk before the next fix attempt. The next fix prompt shows the LLM the best code ever produced — not the regressed version. The fix starts from a known-good foundation.

This is a ratchet. The system only moves forward.

After all retries exhaust:

if tracker.best_code and output_file.exists():
    output_file.write_text(tracker.best_code)
    log.info("Final: best version had %d/%d passed",
             tracker.best_passed,
             tracker.best_result.total if tracker.best_result else 0)

Last attempt regressed? Doesn't matter. You always ship the high-water mark.


Focused Fix Prompts

My first fix implementation dumped the entire pytest output into the prompt. Predictable failure: the LLM tried to fix everything at once, touched too much code, caused cascading regressions.

The fix: extract only the failed test names, filter the pytest output to relevant sections:

def _extract_failed_tests(pytest_output: str) -> list[str]:
    """Extract names of failed/errored tests from pytest output."""
    failed = []
    for line in pytest_output.splitlines():
        m = re.match(r"(?:FAILED|ERROR)\s+\S+::(\S+)", line)
        if m:
            failed.append(m.group(1))
    return failed

The fix prompt itself:

Fix the following test failures in this flux-v4 service code.

## pytest output (these tests must pass)
{filtered_output}

## Current service code (fix this file)
{code}

## Acceptance tests (read-only, do NOT modify these)
{test_code}

Rules:
- Fix the service code so ALL tests pass
- Do NOT change the test file, only fix the service implementation
- Output the complete fixed service file

"Do NOT change the test file" is load-bearing. Without it, LLMs occasionally rewrite the test to match their broken implementation. The tests are ground truth. Code adapts to them, never the reverse.


The Dual Pipeline

First attempt gets the full treatment. Retries skip the expensive parts.

First attempt (write → review → rewrite):

if task["type"] == "service_port":
    review_prompt = _build_review_prompt(code)
    review_response, review_cost = call_llm(router.REVIEW_TIER, review_prompt)
    review_text = review_response.strip()

    if review_text.upper() != "CLEAN":
        rewrite_prompt = _build_rewrite_prompt(task, code, review_text)
        rewrite_response, rewrite_cost = call_llm(tier, rewrite_prompt)
        code = _extract_code(rewrite_response)

Attempts 2–5 (fix only): Extract failures → build focused prompt → call LLM → extract code. No review.

Why: review catches structural violations (wrong imports, missing factory function) — first-attempt problems. By attempt 2, structure is established. Remaining failures are logic bugs, off-by-ones, wrong method signatures. Review won't catch those and just burns budget.


Model Routing

_ROUTES: dict[str, tuple[str, str, int]] = {
    "cheap":    ("gemini",    "gemini-2.5-flash",             16384),
    "mid":      ("anthropic", "claude-sonnet-4-5-20250929",   8192),
    "frontier": ("anthropic", "claude-opus-4-6-20250616",     16384),
}

REVIEW_TIER = "cheap"

Cost estimated at call time from token counts:

def estimate_cost(provider: str, input_tokens: int, output_tokens: int) -> float:
    rates = _COST_PER_M.get(provider, _COST_PER_M["gemini"])
    return (input_tokens * rates["input"] + output_tokens * rates["output"]) / 1_000_000

Every call logged with cost. Running total per task. Guardrails check before every execution.


Budget and Guardrails

The harness runs without a human. Hard constraints, not suggestions:

BUDGET_CAP_WORKDAY = 10.0      # USD
BUDGET_CAP_OVERNIGHT = 30.0    # USD

FORBIDDEN_GIT_OPS = [
    "push --force", "push -f", "reset --hard", "clean -f", "branch -D",
]

FORBIDDEN_DEPLOY = [
    "gcloud functions deploy", "gcloud run deploy", "gcloud app deploy",
]

Budget cap triggers immediate stop. Not "after this task" — now. You do not want to discover Monday morning that an overnight run burned $200 retrying an impossible task.

Forbidden operations aren't LLM instructions. They're harness-level checks the LLM never sees. It can't talk its way around them. GuardrailError kills the task immediately.

All file writes must be under /opt/flux-v4. No prod deploys. No destructive git. No force pushes.


Code Extraction

LLMs return code in unpredictable wrappers. Sometimes markdown fences, sometimes raw text, sometimes explanatory prose with code buried in it. The extractor validates with ast.parse():

def _extract_code(response: str) -> str:
    match = re.search(r"```(?:python)?\s*\n(.*?)```", response, re.DOTALL)
    if match:
        code = match.group(1).strip()
        ast.parse(code)  # SyntaxError if invalid
        return code
    candidate = response.strip()
    ast.parse(candidate)  # reject if not valid Python
    return candidate

If it doesn't parse, it's not code. Reject it. This catches the case where the model returns "Here's the fixed version:" followed by a partial implementation.


The Evaluator

The test runner itself is straightforward. The design decisions are in what it returns and how the retry loop uses it:

@dataclass
class EvalResult:
    passed: bool
    total: int
    failed: int
    errors: int
    output: str  # raw pytest output — fed back into fix prompts

def run_tests(test_path: str, timeout: int = 120) -> EvalResult:
    result = subprocess.run(
        ["python3", "-m", "pytest", str(full_path), "-v", "--tb=short", "-q"],
        capture_output=True, text=True, timeout=timeout,
        cwd=str(BASE_DIR),
    )
    output = result.stdout + result.stderr
    # Parse pytest summary: "X passed, Y failed, Z errors"
    ...
    return EvalResult(
        passed=result.returncode == 0 and failed == 0 and errors == 0,
        total=total, failed=failed, errors=errors, output=output,
    )

120-second timeout per test run. Timeout returns as an error. The output field carries raw pytest text forward into fix prompts — the model needs to see actual tracebacks, not summaries.


Beyond the Retry Loop: Evolution Scoring

The harness handles individual services. The broader system — the evolution engine — runs multiple agents against the same challenge and scores on three axes:

AxisPointsMethod
Automated tests0–63pytest pass count
Principle adherence0–20LLM judge, 10 principles × 0-2 points
Scar tissue0–9grep-based checks for known pitfalls

Scar tissue checks are the simplest layer. Grep for hardcoded API keys. Grep for missing timeout parameters. Grep for bare except:. Each is a regex. Each catches something an LLM will do if you don't penalize it.

The principle judge is more dangerous. Early runs produced 20/20 for code that clearly violated principles — the LLM judge was rubber-stamping. Fix: threshold gating (only judge agents above 50% on automated tests) and calibration examples (show what 0, 1, and 2 look like per principle).

The most dramatic result: one agent passed 61/63 tests. Nearly perfect functionally. Scored 2/20 on principles — raw SDK calls everywhere instead of neutral-intent wrappers. Final: 68/92. The winner: 63/63 tests, 18/20 principles, total 86/92.

An 18-point gap on code that did the same thing.

If you're scoring LLM output only on "does it work," you're measuring the wrong thing.


What Running This Taught Me

The first attempt matters more than the retry loop. A well-crafted prompt with gold standard examples, acceptance tests, and capability signatures passes 60-80% of tests on attempt 1. The retry loop is cleanup. Invest in prompt engineering over retry engineering.

Focused failures beat full dumps. Showing all 17 test results when 3 are failing causes the model to "fix" things that aren't broken. Filter. Name the specific failures. Let it scope changes.

Review is cheap insurance — on the first pass only. Flash review catches structural violations that would fail every test. $0.001 to save a wasted $0.15 Sonnet call. But on retry #3, you're past structural problems — review adds nothing.

Budget caps are non-negotiable. One stuck service with 5 retries costs $3-5. Twenty-three services. Without caps, $100+ overnight.

Language barely matters for functional scores. I ran evolution competitions across Python, Go, TypeScript, and Rust. All achieved similar test pass rates. The differentiation was in principle adherence and architectural style. The agent with the "simplest" genome ("import these three functions from the shared library") beat the agent with the most sophisticated architecture ("hexagonal, dependency injection, adapter pattern") because the LLM could follow one instruction literally and couldn't reliably construct the other from a description.

Infrastructure complexity is a disqualifier. One Go agent had the most sophisticated architecture on paper. Scored 9/92 because the Docker multi-stage build + Cloud Run deploy chain didn't complete within the timeout. Never even got evaluated. Friction kills agents the same way it kills developers — just faster.


The Point

The industry conversation about agents is about capabilities. What can they do. How smart. Which benchmark.

Nobody talks about:

BestTracker is 19 lines. The router is 37. Budget guards are two constants. The code extractor is 15 lines with an ast.parse(). None of it is intellectually difficult.

All of it is necessary. None of it is obvious until you've watched an overnight run burn $30 producing code worse than what you started with.

Running LLM agents in production requires the same discipline as any production system: structured evaluation, regression prevention, budget controls, guardrails that can't be talked around, and the humility to revert when your system is making things worse.

The agents are the easy part. Making them stop is the engineering.

ai-engineering agents testing infrastructure
← Prev: No Humans Allowed