Recipe: Evaluator-Gated Agent Loop
This recipe builds a self-correcting agent: a triformer generates an answer, an evaluator scores it against your criteria, and the loop refines until the work passes. It is the TDD pattern, applied to an LLM — you write the rubric first, then let the agent iterate against it.
The problem it solves
A single agent call gives you whatever the model produces on the first try. You have no quality floor, no record of why an output was good or bad, and no automatic second attempt when it falls short. This recipe puts a graded checkpoint between “generate” and “done”, so a run only completes when it clears a bar you defined.
Elements
| Element | Role |
|---|---|
triformer | The agent that drafts and revises the work. |
evaluator | Scores each draft against weighted criteria and returns a pass/fail scorecard. |
prompt | Holds the reusable task template (and any A/B variants). |
automation | Drives the generate → evaluate → refine loop as a directed graph. |
Flow
- Create a
triformer. Anevaluatoris auto-provisioned when the agent is created — you do not attach one by hand, you configure its criteria. - Define the rubric on the evaluator. Each criterion carries a
name, aweight, and athreshold; an overallpass_thresholddecides the verdict. Read it back any time with the evaluator’scriteriaoperation. - Run the agent with
generate, passing your task. You get back ageneration_id. - Score the result with the evaluator’s
evaluateoperation — pass the agent’sgeneration_id(or inlinedata.output) and yourcriteria. It returns anevaluation_idand ascorecardwithoverall_score, apassedflag, and per-criterion feedback. - If
passedis false, call the agent’srefineoperation with the originalgeneration_idand the scorecard’s feedback as the new prompt.refinecontinues the same conversation, so the agent keeps its prior tool-call history while it corrects. - Wrap steps 3–5 in an
automation. Automations referencetriformer,evaluator, andpromptas steps, and their built-inconditionandloopchildren let you repeatrefine/evaluateuntil the scorecard passes or you hit a retry cap. - Track quality over time with the evaluator’s
resultsoperation (returns apass_rate) andcompare(returns atrendof improving / stable / declining).
What this shows
Quality becomes a first-class, queryable artifact rather than a vibe. The same evaluator that gates the loop also keeps the scorecards, so you can ask “what is our pass rate this week?” without bolting on a separate analytics stack. Because the rubric lives on the evaluator and the task lives on a prompt, you can tune either independently — change the bar without touching the agent, or A/B two prompt variants against the same bar.