Evaluator
The quality gate for automated pipelines — an evaluator scores another actor's output against weighted criteria and emits a standardized 0–100 scorecard, using custom code, an LLM-as-judge, or both, so a run can be measured the same way every time.
Working with it
Opening a Evaluator launches a code editor — its dedicated working surface.
How it appears
The same element type rendered as a definition, a circle instance, and a live workspace card.
When to use / not
When to use
- Scoring an agent's output after a task — attach it as a modifier and, once you turn its auto mode on (off by default, so attaching one never silently scores every call), the runtime runs it on each invocation and flags low-quality results.
- Grading along multiple weighted dimensions (accuracy, clarity, empathy, …) into one comparable score rather than a single yes/no check.
- Using an LLM-as-judge to rate subjective output an assertion can't express — turn on llm_assisted and point it at a model and prompt.
- Tracking whether quality is improving or regressing over time, and sampling only a fraction of high-throughput output to keep cost down in live stages.
When not to use
- Stopping a flow when output is bad — an evaluator records a score, it does not halt execution; branch on the result with a condition element instead.
- Enforcing a hard input schema or rejecting malformed data outright — that is the validation modifier's job; an evaluator measures quality, it doesn't bounce bad shapes.
- Stripping or masking forbidden content — reach for filter-words; an evaluator scores, it doesn't rewrite.
Topology
Attaches to another element as a modifier, shaping that element's behaviour rather than running on its own.
Properties
criteriaarray- Evaluation criteria — each criterion is scored independently then aggregated
scaleobject- Evaluation scale — defines the range for criterion scores. Scores are normalized to 0-100 internally.
modestring- Evaluation mode. 'code' forces the inline/handler code path (deterministic, fast). 'llm' forces LLM-assisted evaluation (flexible, uses rubrics). 'delegate' forces scoring via the referenced actor in spec.element.ref (an agent or function). 'auto' selects delegate if spec.element.ref is set, else LLM if llm_assisted.enabled is true, otherwise falls back to code.
elementobject- Delegate scoring to a referenced actor element (an agent or function) instead of inline code or the LLM judge. The referenced element is invoked with the content to evaluate, and its output is normalized into the standard scorecard.
autoboolean- Auto-invoke this evaluator after every run of the element it is attached to. When true, the platform runs this evaluator against the target's output automatically (no explicit evaluate call) and emits an evaluator.auto_evaluate.completed event — the basis of the test-driven evaluator loop. Default off so attaching an evaluator never silently adds an evaluation to every production invocation. Honors sampling.rate.
scoring_methodstring- Scoring methodology
thresholdsobject- Score thresholds for pass/fail (0-100)
Capabilities
Defined for this element
- Compute
- Observe
Operations
- attachPOST
- compareGET
- criteriaGET
- deleteDELETE
- detachPOST
- disablePOST
- enablePOST
- evaluatePOST
- getGET
- get_attached_modifiersGET
- intentionGET
- list_attachmentsGET
- readme_updatePOST
- result_getGET
- resultsGET
- schemaGET
- updatePATCH
Ports
Inputs
- contentrequest
- criteriarequest
- resultrequest
Composition
Errors / when it fails
- When llm_assisted.enabled is true, llm_assisted.model_ref must be provided
- When code.language is set, code.source must be provided
Validation rules
- Low sampling rate (<10%) — most outputs will not be evaluated
- LLM-assisted evaluation without criteria may produce inconsistent results
Evaluator (evaluator)
Category: modifiers | Form: | Symbol: Ev
Validate and score actor outputs using custom code
Scores actor outputs against defined criteria, producing standardized scorecards (0-100 score, pass/fail). Three modes: (1) LLM-assisted — set spec.llm_assisted.enabled: true and optionally spec.llm_assisted.model_ref (a brain element; defaults to the platform evaluator model). (2) Runtime code — set spec.code.source + spec.code.language like a function, code must return {score, criteria:[]}. (3) Delegate — set spec.element.ref to an agent or function slug to score with that actor (mode: delegate, or auto when a ref is set). Define criteria in spec.criteria array with name, weight, rubric fields. Threshold defaults to 70 (spec.thresholds.pass). Scores stored as 0-100 range; values 0-1 are auto-scaled to 0-100. Can evaluate by generation_id (references an agent generation) or by passing data.response directly. Use compare operation to track score trends over time. Common mistake: not configuring LLM-assisted, runtime code, or a delegate ref — one is required.
Guide
Validate and score actor outputs using custom code
What It Does
Evaluator measures the quality of another actor’s output against a set of weighted criteria and produces a standardized scorecard. It can use custom code (JavaScript or Python), LLM-as-judge evaluation, or both in combination. When attached to an agent element, it runs automatically after each invocation and flags low-quality outputs before they reach downstream elements. Evaluators are the quality gate in automated pipelines.
Element Definition
| Property | Value |
|---|---|
| Type | evaluator |
| Category | modifiers |
| Form | modifier |
| Symbol | fact_check / #10B981 |
Properties
| Field | Type | Default | Description |
|---|---|---|---|
handler | string | — | Reference to the handler code/function for evaluation |
criteria | array | — | Evaluation criteria (required). Each item: name, weight (0–1), evaluator (code ref) |
scoring_method | string (enum) | numeric | binary, numeric, categorical, or weighted |
thresholds.pass | number | 70 | Minimum score to pass (0–100) |
thresholds.excellent | number | 90 | Score threshold for excellent rating (0–100) |
code.language | string (enum) | javascript | Evaluation code language: javascript or python |
code.source | string | — | Inline code or reference to a code element |
code.timeout_ms | integer | 5000 | Maximum code execution time in milliseconds |
input_schema | object | — | JSON Schema for the inputs being evaluated |
output_schema.include_reasoning | boolean | true | Include per-criterion explanation in results |
output_schema.include_suggestions | boolean | false | Include improvement suggestions in results |
sampling.rate | number | 1.0 | Fraction of outputs to evaluate (0–1) |
sampling.strategy | string (enum) | all | all, random, error_only, or periodic |
llm_assisted.enabled | boolean | false | Enable LLM-as-judge evaluation |
llm_assisted.model_ref | string | — | Reference to LLM supplier element |
llm_assisted.prompt_ref | string | — | Reference to evaluation prompt element |
alerts.on_failure | boolean | true | Send alert when evaluation fails |
alerts.threshold | number | — | Score below which to send an alert (0–100) |
storage.enabled | boolean | true | Store evaluation results |
storage.retention_days | integer | 30 | Days to retain evaluation records |
Ports
| Direction | Port | Type | Required | Description |
|---|---|---|---|---|
| Input | content | request | Yes | Content to evaluate with optional reference/ground truth |
| Input | criteria | request | Yes | Evaluation criteria with weights |
| Output | result | request | Yes | Evaluation result: overall score, per-criterion breakdown |
Topology
- Lives in:
modifiers/governance/evaluator/repository - Referenced by: projects
- Accepts modifiers:
rate-limit,auth-policy,requirements - Uses resources:
prompt,variable,sql,document,vector
Capabilities
| Capability | Description |
|---|---|
llm-judge | LLM-as-judge evaluation |
weighted-scoring | Weighted multi-criteria scoring |
consistency | Consistent scoring across runs |
Error Codes
| Code | Class | Retryable | Description |
|---|---|---|---|
EVALUATOR_CRITERIA_INVALID | validation | No | Invalid evaluation criteria |
EVALUATOR_CONTENT_EMPTY | validation | No | No content to evaluate |
Quick Start
Creating via API
Create an evaluator inside a project:
POST /api/{circle}/{project}/
Content-Type: application/json
{
"element_type": "evaluator",
"slug": "response-quality",
"name": "Response Quality Evaluator",
"spec": {
"scoring_method": "weighted",
"thresholds": {
"pass": 75,
"excellent": 90
},
"criteria": [
{"name": "accuracy", "weight": 0.5, "evaluator": "check_accuracy"},
{"name": "clarity", "weight": 0.3, "evaluator": "check_clarity"},
{"name": "completeness", "weight": 0.2, "evaluator": "check_completeness"}
]
}
}
Basic Usage
Invoke the evaluator directly with content and criteria:
POST /api/{circle}/{project}/response-quality/ops/invoke
Content-Type: application/json
{
"content": "The capital of France is Paris, located in the north of the country.",
"reference": "Paris is the capital of France."
}
Project Patterns
How Evaluator Fits Into Projects
Evaluators serve two roles in a project. The first is as an attached modifier on an agent — in this mode, the evaluator runs automatically after every invocation and flags low-quality outputs. The second is as a standalone element wired into a flow — in this mode, the flow explicitly routes output through the evaluator and branches on the pass/fail result using a condition element.
In development stage, evaluators run on every output. In demo and live stages, the sampling.rate property allows you to evaluate a representative fraction rather than every single invocation.
Example Project Spec
# LLM-judge evaluator for customer support responses
elements:
- element_type: evaluator
slug: support-evaluator
spec:
scoring_method: weighted
llm_assisted:
enabled: true
model_ref: "claude-sonnet-4-6"
criteria:
- name: empathy
weight: 0.3
evaluator: rate_empathy
- name: accuracy
weight: 0.5
evaluator: rate_accuracy
- name: brevity
weight: 0.2
evaluator: rate_brevity
thresholds:
pass: 70
excellent: 88
output_schema:
include_reasoning: true
include_suggestions: true
Common Patterns
Auto-Evaluator Attached to an Agent
Attach an evaluator to an agent so every task is scored automatically:
# Create the evaluator
POST /api/{circle}/{project}/
{
"element_type": "evaluator",
"slug": "task-quality-gate",
"spec": {
"scoring_method": "binary",
"criteria": [
{"name": "task_completed", "weight": 1.0}
],
"thresholds": {"pass": 80}
}
}
# Attach it to the agent
POST /api/{circle}/{project}/task-quality-gate/ops/attach
{ "target_id": "<agent-uuid>" }
When the agent completes a task, the evaluator runs automatically and the result is recorded against the task.
Sampling for Production Pipelines
For high-throughput flows in live stage, evaluate only a random 10% sample to reduce cost:
POST /api/{circle}/{project}/
{
"element_type": "evaluator",
"slug": "sampled-evaluator",
"spec": {
"scoring_method": "numeric",
"criteria": [{"name": "quality", "weight": 1.0}],
"sampling": {
"rate": 0.1,
"strategy": "random"
},
"storage": {
"enabled": true,
"retention_days": 90
}
}
}
Attaching Resources
Evaluators can use data resources to look up reference material or ground truth during scoring:
POST /api/{circle}/{project}/{resource-slug}/ops/attach
{ "target_id": "<evaluator-uuid>" }
| Resource | Use case |
|---|---|
vector | Semantic similarity scoring against a reference corpus |
document | Load rubrics or scoring guidelines |
sql | Look up expected values from a database |
prompt | Inject evaluation prompt template for LLM-as-judge mode |
Applying Modifiers
| Modifier | Use case |
|---|---|
rate-limit | Prevent evaluation floods when sampling.strategy is all |
auth-policy | Restrict who can view evaluation results |
requirements | Enforce that required data resources are available before evaluating |
Common Mistakes
Missing required criteria and scoring_method.
Both are required fields. An evaluator with no criteria has nothing to score against and will fail validation before it runs.
Weights that do not sum to 1.0 in weighted mode.
In weighted scoring, weights should sum to 1.0 across all criteria. Values like 0.5, 0.3, 0.2 sum correctly. Weights that sum above 1.0 inflate scores; weights that sum below 1.0 suppress them.
Expecting the evaluator to halt a flow on failure.
When used as an attached modifier, the evaluator records a score but does not automatically stop execution. To gate on scores, wire the evaluator into a flow explicitly and use a condition element to branch on the result.score value.
Using evaluator form (modifier) versus evaluator element type.
Evaluator is both an element type and a valid entry in attaches: for agent. When you see evaluator in an attaches: list, that means attach this evaluator element as a modifier — the runtime calls it automatically after the parent actor completes.
Relationships
- Attaches to: circle, rate-limit, auth-policy, app, automation, document, graph, sql, timeseries, vector, files, python, javascript, ruby, rust-fn, go-fn, csharp
- Uses: prompt, variable, sql, document, vector
Capabilities
- llm-judge: LLM-as-judge evaluation
- weighted-scoring: Weighted multi-criteria scoring
- consistency: Consistent scoring across runs
Properties
| Property | Type | Default | Description |
|---|---|---|---|
handler | string | — | Reference to the handler code/function for evaluation |
criteria | array | — | Evaluation criteria — each criterion is scored independently then aggregated |
scale | object | — | Evaluation scale — defines the range for criterion scores. Scores are normalized to 0-100 internally. |
mode | string | "auto" | Evaluation mode. ‘code’ forces the inline/handler code path (deterministic, fast). ‘llm’ forces LLM-assisted evaluation (flexible, uses rubrics). ‘delegate’ forces scoring via the referenced actor in spec.element.ref (an agent or function). ‘auto’ selects delegate if spec.element.ref is set, else LLM if llm_assisted.enabled is true, otherwise falls back to code. |
element | object | — | Delegate scoring to a referenced actor element (an agent or function) instead of inline code or the LLM judge. The referenced element is invoked with the content to evaluate, and its output is normalized into the standard scorecard. |
auto | boolean | false | Auto-invoke this evaluator after every run of the element it is attached to. When true, the platform runs this evaluator against the target’s output automatically (no explicit evaluate call) and emits an evaluator.auto_evaluate.completed event — the basis of the test-driven evaluator loop. Default off so attaching an evaluator never silently adds an evaluation to every production invocation. Honors sampling.rate. |
scoring_method | string | "numeric" | Scoring methodology |
thresholds | object | — | Score thresholds for pass/fail (0-100) |
code | object | — | Custom evaluation code — inline or reference |
input_schema | object | — | Schema for inputs to evaluate |
output_schema | object | — | Expected evaluation result schema |
sampling | object | — | Controls which outputs get evaluated — useful for high-volume actors |
llm_assisted | object | — | AI-assisted evaluation — uses an LLM to judge quality |
alerts | object | — | Alert configuration for failed evaluations |
storage | object | — | Evaluation result storage — stores scores and reasoning for analysis |
Operations
attach
Post /ops/attach | Auth: Read
Attach this modifier to a target element
Attaches this modifier to a target element. The target_id must be a UUID of an existing element that supports this modifier type (check applies_to in definition.yaml). Priority controls evaluation order when multiple modifiers of the same type are attached — lower priority runs first. The attachment is stored in element_modifiers table. Cascade resolution runs at bond-time to merge this modifier into the target’s resolved config. Common mistake: attaching to an incompatible element type — check topology rules first.
compare
Get /ops/compare | Auth: Read
Compare evaluation results over time
Compare score trends between two time windows. Set period to day, week (default), or month. Returns current_avg, previous_avg, change percentage, and trend (improving/stable/declining). Useful for tracking quality over time.
criteria
Get /ops/criteria | Auth: Read
Get configured evaluation criteria
Get the evaluator’s configured criteria definitions from spec. Returns criteria array with name, description, weight, threshold, and type for each criterion. Also returns the overall pass_threshold.
delete
Delete /ops/delete | Auth: Admin
Delete element (soft delete)
Soft delete — sets state to ‘deleted’ but retains the record. Cannot delete elements that have children (has_no_bond precondition) or active runs. Requires admin auth and confirmation.
detach
Post /ops/detach | Auth: Read
Detach this modifier from a target element
Removes this modifier from a target element. Requires the target_id. Pervasive modifiers (audit, policy) can only be detached at the level they were originally attached — inherited pervasive modifiers cannot be detached by child elements. After detach, cascade resolution re-runs to remove this modifier’s effect from the resolved config.
disable
Post /ops/disable | Auth: Admin
Disable element (hides and prevents use)
Idempotent — safe to call on already-disabled elements. Optionally pass a reason string. Disabled elements cannot be invoked or executed. Inverse of enable.
enable
Post /ops/enable | Auth: Admin
Enable element (makes usable and visible)
Idempotent — safe to call on already-enabled elements. Transitions element to ready/enabled state. Cannot enable deleted elements. Inverse of disable.
evaluate
Post /ops/evaluate | Auth: Execute
Run evaluation against input data
Score content against criteria. Pass data.response or data.output for inline content, or data.generation_id to evaluate an agent generation. Criteria come from input.criteria (array or dict) or fall back to spec.criteria. Array: [{name, weight, threshold}]. Dict: {name: weight} or {name: {weight, threshold, rubric}}. Returns evaluation_id and scorecard with 0-100 scores. For inline code (spec.code.source + spec.code.language), function must be named handler(), main(), or entrypoint(). Return format: {score: N} or {overall_score: N, criteria: {name: score}} where N is 0.0-1.0.
get
Get /ops/get | Auth: Read
Get element details
Element is already resolved by the routing layer — this returns the cached element, not a fresh DB query. Use the path /api/{circle}/{slug} to address elements.
get_attached_modifiers
Get /ops/attached/{target_id} | Auth: Read
Get all modifiers attached to a target element
Lists all modifiers attached to a specific target element, including modifier_id, type, subcategory, and priority. Useful for debugging cascade resolution or understanding which policies apply to an element before invoking it.
intention
Get /ops/intention | Auth: Read
Get element intention with full inheritance chain
Returns three levels: direct (this element’s intention), inherited (from category and root), and resolved (final merged intention). Useful for understanding an element’s purpose in context of its hierarchy.
list_attachments
Get /ops/targets | Auth: Read
List all elements this modifier is attached to
Returns all target elements where this modifier is currently applied. Shows target_id, target_type, priority, and cascade_policy.
readme_update
Post /ops/readme_update | Auth: Write
Update element README.md content
Creates or overwrites README.md in the element’s git repo. Commits to the draft branch. Content must be provided as a markdown string.
result_get
Get /ops/results/{evaluation_id} | Auth: Read
Get a specific evaluation result
Get full scorecard details including per-criterion scores, input data, and timestamps. Requires evaluation_id. Only returns results belonging to this evaluator element.
results
Get /ops/results | Auth: Read
List evaluation results (scorecards)
List historical evaluation scorecards. Filter by ?passed=true|false, ?since=
, ?subject_element_id, ?subject_generation_id. Returns pass_rate across matching results. Paginate with ?limit (max 200) and ?offset.
schema
Get /ops/schema | Auth: Read
Get element input/output schema (MCP tools/list compatible)
Returns type-level port schemas from the TypeRegistry — not instance-specific overrides. Includes direction (input/output), required flag, and JSON schema per port. Useful for understanding what data an element accepts and produces.
update
Patch /ops/update | Auth: Write
Update element
Partial update — send only the fields you want to change.
spec,name, andintentionare all independently optional.specMUST be a JSON object when present; deep-merged into the existing spec by default. Empty{"spec":{}}preserves existing spec content but still records a new version (no-op for content, not for version state). To clear/replace the entire spec wholesale send{"spec":{...},"deep":false}. List-typed spec fields use replace semantics (the patch list replaces the existing list, no array merging). Coordinates Git + DB writes. Slug cannot be changed after creation.
Error Codes
| Code | Class | Retryable | Description |
|---|---|---|---|
EVALUATOR_CRITERIA_INVALID | validation | no | Invalid evaluation criteria |
EVALUATOR_CONTENT_EMPTY | validation | no | No content to evaluate |
Lifecycle / runtime
Defined for this element
Before invoke
- validate_input
- check_rate_limit
After invoke
- record_metrics
- emit_traces
On error
- log_error
- record_error_metric
Observability
Defined for this element
Metrics
- evaluation_count
- duration_ms
- error_rate
Events
- evaluator.completed
- evaluator.failed
Pricing / cost
Platform default
Operation costs
- create: free
- update: free
- delete: free
- get: free
- list: free
- invoke: 10000 micro-AU
- tool_use: free
Set it up
- Criteriastring
- What to evaluate
- Scoring Methodstring