Download all docs

Recipe: Evaluator-Gated Agent Loop

This recipe builds a self-correcting agent: a triformer generates an answer, an evaluator scores it against your criteria, and the loop refines until the work passes. It is the TDD pattern, applied to an LLM — you write the rubric first, then let the agent iterate against it.

The problem it solves

A single agent call gives you whatever the model produces on the first try. You have no quality floor, no record of why an output was good or bad, and no automatic second attempt when it falls short. This recipe puts a graded checkpoint between “generate” and “done”, so a run only completes when it clears a bar you defined.

Elements

ElementRole
triformerThe agent that drafts and revises the work.
evaluatorScores each draft against weighted criteria and returns a pass/fail scorecard.
promptHolds the reusable task template (and any A/B variants).
automationDrives the generate → evaluate → refine loop as a directed graph.

Flow

  1. Create a triformer. An evaluator is auto-provisioned when the agent is created — you do not attach one by hand, you configure its criteria.
  2. Define the rubric on the evaluator. Each criterion carries a name, a weight, and a threshold; an overall pass_threshold decides the verdict. Read it back any time with the evaluator’s criteria operation.
  3. Run the agent with generate, passing your task. You get back a generation_id.
  4. Score the result with the evaluator’s evaluate operation — pass the agent’s generation_id (or inline data.output) and your criteria. It returns an evaluation_id and a scorecard with overall_score, a passed flag, and per-criterion feedback.
  5. If passed is false, call the agent’s refine operation with the original generation_id and the scorecard’s feedback as the new prompt. refine continues the same conversation, so the agent keeps its prior tool-call history while it corrects.
  6. Wrap steps 3–5 in an automation. Automations reference triformer, evaluator, and prompt as steps, and their built-in condition and loop children let you repeat refine/evaluate until the scorecard passes or you hit a retry cap.
  7. Track quality over time with the evaluator’s results operation (returns a pass_rate) and compare (returns a trend of improving / stable / declining).

What this shows

Quality becomes a first-class, queryable artifact rather than a vibe. The same evaluator that gates the loop also keeps the scorecards, so you can ask “what is our pass rate this week?” without bolting on a separate analytics stack. Because the rubric lives on the evaluator and the task lives on a prompt, you can tune either independently — change the bar without touching the agent, or A/B two prompt variants against the same bar.

Next pages