The quality gate for automated pipelines — an evaluator scores another actor's output against weighted criteria and emits a standardized 0–100 scorecard, using custom code, an LLM-as-judge, or both, so a run can be measured the same way every time.

Working with it

Opening a Evaluator launches a code editor — its dedicated working surface.

How it appears

The same element type rendered as a definition, a circle instance, and a live workspace card.

type

Evaluator

Validate and score actor outputs using custom code

modifiersmodifierdefinition

When to use / not

When to use

Scoring an agent's output after a task — attach it as a modifier and, once you turn its auto mode on (off by default, so attaching one never silently scores every call), the runtime runs it on each invocation and flags low-quality results.
Grading along multiple weighted dimensions (accuracy, clarity, empathy, …) into one comparable score rather than a single yes/no check.
Using an LLM-as-judge to rate subjective output an assertion can't express — turn on llm_assisted and point it at a model and prompt.
Tracking whether quality is improving or regressing over time, and sampling only a fraction of high-throughput output to keep cost down in live stages.

When not to use

Stopping a flow when output is bad — an evaluator records a score, it does not halt execution; branch on the result with a condition element instead.
Enforcing a hard input schema or rejecting malformed data outright — that is the validation modifier's job; an evaluator measures quality, it doesn't bounce bad shapes.
Stripping or masking forbidden content — reach for filter-words; an evaluator scores, it doesn't rewrite.

Topology

Attaches to another element as a modifier, shaping that element's behaviour rather than running on its own.

Properties

criteriaarray: Evaluation criteria — each criterion is scored independently then aggregated
scaleobject: Evaluation scale — defines the range for criterion scores. Scores are normalized to 0-100 internally.
modestring: Evaluation mode. 'code' forces the inline/handler code path (deterministic, fast). 'llm' forces LLM-assisted evaluation (flexible, uses rubrics). 'delegate' forces scoring via the referenced actor in spec.element.ref (an agent or function). 'auto' selects delegate if spec.element.ref is set, else LLM if llm_assisted.enabled is true, otherwise falls back to code.
elementobject: Delegate scoring to a referenced actor element (an agent or function) instead of inline code or the LLM judge. The referenced element is invoked with the content to evaluate, and its output is normalized into the standard scorecard.
autoboolean: Auto-invoke this evaluator after every run of the element it is attached to. When true, the platform runs this evaluator against the target's output automatically (no explicit evaluate call) and emits an evaluator.auto_evaluate.completed event — the basis of the test-driven evaluator loop. Default off so attaching an evaluator never silently adds an evaluation to every production invocation. Honors sampling.rate.
scoring_methodstring: Scoring methodology
thresholdsobject: Score thresholds for pass/fail (0-100)

Capabilities

Defined for this element

Compute
Observe

Operations

attachPOST
compareGET
criteriaGET
deleteDELETE
detachPOST
disablePOST
enablePOST
evaluatePOST
getGET
get_attached_modifiersGET
intentionGET
list_attachmentsGET
readme_updatePOST
result_getGET
resultsGET
schemaGET
updatePATCH

Ports

Inputs

contentrequest
criteriarequest
resultrequest

Composition

Uses

Attaches

Referenced by

Errors / when it fails

When llm_assisted.enabled is true, llm_assisted.model_ref must be provided
When code.language is set, code.source must be provided

Validation rules

Low sampling rate (<10%) — most outputs will not be evaluated
LLM-assisted evaluation without criteria may produce inconsistent results

Evaluator (evaluator)

Category: modifiers | Form: | Symbol: Ev

Validate and score actor outputs using custom code

Scores actor outputs against defined criteria, producing standardized scorecards (0-100 score, pass/fail). Three modes: (1) LLM-assisted — set spec.llm_assisted.enabled: true and optionally spec.llm_assisted.model_ref (a brain element; defaults to the platform evaluator model). (2) Runtime code — set spec.code.source + spec.code.language like a function, code must return {score, criteria:[]}. (3) Delegate — set spec.element.ref to an agent or function slug to score with that actor (mode: delegate, or auto when a ref is set). Define criteria in spec.criteria array with name, weight, rubric fields. Threshold defaults to 70 (spec.thresholds.pass). Scores stored as 0-100 range; values 0-1 are auto-scaled to 0-100. Can evaluate by generation_id (references an agent generation) or by passing data.response directly. Use compare operation to track score trends over time. Common mistake: not configuring LLM-assisted, runtime code, or a delegate ref — one is required.

Guide

Validate and score actor outputs using custom code

What It Does

Evaluator measures the quality of another actor’s output against a set of weighted criteria and produces a standardized scorecard. It can use custom code (JavaScript or Python), LLM-as-judge evaluation, or both in combination. When attached to an agent element, it runs automatically after each invocation and flags low-quality outputs before they reach downstream elements. Evaluators are the quality gate in automated pipelines.

Element Definition

Property	Value
Type	`evaluator`
Category	`modifiers`
Form	`modifier`
Symbol	`fact_check` / `#10B981`

Properties

Field	Type	Default	Description
`handler`	string	—	Reference to the handler code/function for evaluation
`criteria`	array	—	Evaluation criteria (required). Each item: `name`, `weight` (0–1), `evaluator` (code ref)
`scoring_method`	string (enum)	`numeric`	`binary`, `numeric`, `categorical`, or `weighted`
`thresholds.pass`	number	`70`	Minimum score to pass (0–100)
`thresholds.excellent`	number	`90`	Score threshold for excellent rating (0–100)
`code.language`	string (enum)	`javascript`	Evaluation code language: `javascript` or `python`
`code.source`	string	—	Inline code or reference to a code element
`code.timeout_ms`	integer	`5000`	Maximum code execution time in milliseconds
`input_schema`	object	—	JSON Schema for the inputs being evaluated
`output_schema.include_reasoning`	boolean	`true`	Include per-criterion explanation in results
`output_schema.include_suggestions`	boolean	`false`	Include improvement suggestions in results
`sampling.rate`	number	`1.0`	Fraction of outputs to evaluate (0–1)
`sampling.strategy`	string (enum)	`all`	`all`, `random`, `error_only`, or `periodic`
`llm_assisted.enabled`	boolean	`false`	Enable LLM-as-judge evaluation
`llm_assisted.model_ref`	string	—	Reference to LLM supplier element
`llm_assisted.prompt_ref`	string	—	Reference to evaluation prompt element
`alerts.on_failure`	boolean	`true`	Send alert when evaluation fails
`alerts.threshold`	number	—	Score below which to send an alert (0–100)
`storage.enabled`	boolean	`true`	Store evaluation results
`storage.retention_days`	integer	`30`	Days to retain evaluation records

Ports

Direction	Port	Type	Required	Description
Input	`content`	request	Yes	Content to evaluate with optional reference/ground truth
Input	`criteria`	request	Yes	Evaluation criteria with weights
Output	`result`	request	Yes	Evaluation result: overall score, per-criterion breakdown

Topology

Lives in: modifiers/governance/evaluator/ repository
Referenced by: projects
Accepts modifiers: rate-limit, auth-policy, requirements
Uses resources: prompt, variable, sql, document, vector

Capabilities

Capability	Description
`llm-judge`	LLM-as-judge evaluation
`weighted-scoring`	Weighted multi-criteria scoring
`consistency`	Consistent scoring across runs

Error Codes

Code	Class	Retryable	Description
`EVALUATOR_CRITERIA_INVALID`	validation	No	Invalid evaluation criteria
`EVALUATOR_CONTENT_EMPTY`	validation	No	No content to evaluate

Quick Start

Creating via API

Create an evaluator inside a project:

POST /api/{circle}/{project}/
Content-Type: application/json

{
  "element_type": "evaluator",
  "slug": "response-quality",
  "name": "Response Quality Evaluator",
  "spec": {
    "scoring_method": "weighted",
    "thresholds": {
      "pass": 75,
      "excellent": 90
    },
    "criteria": [
      {"name": "accuracy", "weight": 0.5, "evaluator": "check_accuracy"},
      {"name": "clarity", "weight": 0.3, "evaluator": "check_clarity"},
      {"name": "completeness", "weight": 0.2, "evaluator": "check_completeness"}
    ]
  }
}

Basic Usage

Invoke the evaluator directly with content and criteria:

POST /api/{circle}/{project}/response-quality/ops/invoke
Content-Type: application/json

{
  "content": "The capital of France is Paris, located in the north of the country.",
  "reference": "Paris is the capital of France."
}

Project Patterns

How Evaluator Fits Into Projects

Evaluators serve two roles in a project. The first is as an attached modifier on an agent — in this mode, the evaluator runs automatically after every invocation and flags low-quality outputs. The second is as a standalone element wired into a flow — in this mode, the flow explicitly routes output through the evaluator and branches on the pass/fail result using a condition element.

In development stage, evaluators run on every output. In demo and live stages, the sampling.rate property allows you to evaluate a representative fraction rather than every single invocation.

Example Project Spec

# LLM-judge evaluator for customer support responses
elements:
  - element_type: evaluator
    slug: support-evaluator
    spec:
      scoring_method: weighted
      llm_assisted:
        enabled: true
        model_ref: "claude-sonnet-4-6"
      criteria:
        - name: empathy
          weight: 0.3
          evaluator: rate_empathy
        - name: accuracy
          weight: 0.5
          evaluator: rate_accuracy
        - name: brevity
          weight: 0.2
          evaluator: rate_brevity
      thresholds:
        pass: 70
        excellent: 88
      output_schema:
        include_reasoning: true
        include_suggestions: true

Common Patterns

Auto-Evaluator Attached to an Agent

Attach an evaluator to an agent so every task is scored automatically:

# Create the evaluator
POST /api/{circle}/{project}/

{
  "element_type": "evaluator",
  "slug": "task-quality-gate",
  "spec": {
    "scoring_method": "binary",
    "criteria": [
      {"name": "task_completed", "weight": 1.0}
    ],
    "thresholds": {"pass": 80}
  }
}

# Attach it to the agent
POST /api/{circle}/{project}/task-quality-gate/ops/attach

{ "target_id": "<agent-uuid>" }

When the agent completes a task, the evaluator runs automatically and the result is recorded against the task.

Sampling for Production Pipelines

For high-throughput flows in live stage, evaluate only a random 10% sample to reduce cost:

POST /api/{circle}/{project}/

{
  "element_type": "evaluator",
  "slug": "sampled-evaluator",
  "spec": {
    "scoring_method": "numeric",
    "criteria": [{"name": "quality", "weight": 1.0}],
    "sampling": {
      "rate": 0.1,
      "strategy": "random"
    },
    "storage": {
      "enabled": true,
      "retention_days": 90
    }
  }
}

Attaching Resources

Evaluators can use data resources to look up reference material or ground truth during scoring:

POST /api/{circle}/{project}/{resource-slug}/ops/attach

{ "target_id": "<evaluator-uuid>" }

Resource	Use case
`vector`	Semantic similarity scoring against a reference corpus
`document`	Load rubrics or scoring guidelines
`sql`	Look up expected values from a database
`prompt`	Inject evaluation prompt template for LLM-as-judge mode

Applying Modifiers

Modifier	Use case
`rate-limit`	Prevent evaluation floods when `sampling.strategy` is `all`
`auth-policy`	Restrict who can view evaluation results
`requirements`	Enforce that required data resources are available before evaluating

Common Mistakes

Missing required criteria and scoring_method. Both are required fields. An evaluator with no criteria has nothing to score against and will fail validation before it runs.

Weights that do not sum to 1.0 in weighted mode. In weighted scoring, weights should sum to 1.0 across all criteria. Values like 0.5, 0.3, 0.2 sum correctly. Weights that sum above 1.0 inflate scores; weights that sum below 1.0 suppress them.

Expecting the evaluator to halt a flow on failure. When used as an attached modifier, the evaluator records a score but does not automatically stop execution. To gate on scores, wire the evaluator into a flow explicitly and use a condition element to branch on the result.score value.

Using evaluator form (modifier) versus evaluator element type. Evaluator is both an element type and a valid entry in attaches: for agent. When you see evaluator in an attaches: list, that means attach this evaluator element as a modifier — the runtime calls it automatically after the parent actor completes.

Relationships

Attaches to: circle, rate-limit, auth-policy, app, automation, document, graph, sql, timeseries, vector, files, python, javascript, ruby, rust-fn, go-fn, csharp
Uses: prompt, variable, sql, document, vector

Capabilities

llm-judge: LLM-as-judge evaluation
weighted-scoring: Weighted multi-criteria scoring
consistency: Consistent scoring across runs

Properties

Property	Type	Default	Description
`handler`	string	—	Reference to the handler code/function for evaluation
`criteria`	array	—	Evaluation criteria — each criterion is scored independently then aggregated
`scale`	object	—	Evaluation scale — defines the range for criterion scores. Scores are normalized to 0-100 internally.
`mode`	string	`"auto"`	Evaluation mode. ‘code’ forces the inline/handler code path (deterministic, fast). ‘llm’ forces LLM-assisted evaluation (flexible, uses rubrics). ‘delegate’ forces scoring via the referenced actor in spec.element.ref (an agent or function). ‘auto’ selects delegate if spec.element.ref is set, else LLM if llm_assisted.enabled is true, otherwise falls back to code.
`element`	object	—	Delegate scoring to a referenced actor element (an agent or function) instead of inline code or the LLM judge. The referenced element is invoked with the content to evaluate, and its output is normalized into the standard scorecard.
`auto`	boolean	`false`	Auto-invoke this evaluator after every run of the element it is attached to. When true, the platform runs this evaluator against the target’s output automatically (no explicit evaluate call) and emits an evaluator.auto_evaluate.completed event — the basis of the test-driven evaluator loop. Default off so attaching an evaluator never silently adds an evaluation to every production invocation. Honors sampling.rate.
`scoring_method`	string	`"numeric"`	Scoring methodology
`thresholds`	object	—	Score thresholds for pass/fail (0-100)
`code`	object	—	Custom evaluation code — inline or reference
`input_schema`	object	—	Schema for inputs to evaluate
`output_schema`	object	—	Expected evaluation result schema
`sampling`	object	—	Controls which outputs get evaluated — useful for high-volume actors
`llm_assisted`	object	—	AI-assisted evaluation — uses an LLM to judge quality
`alerts`	object	—	Alert configuration for failed evaluations
`storage`	object	—	Evaluation result storage — stores scores and reasoning for analysis

Operations

`attach`

Post /ops/attach | Auth: Read

Attach this modifier to a target element

Attaches this modifier to a target element. The target_id must be a UUID of an existing element that supports this modifier type (check applies_to in definition.yaml). Priority controls evaluation order when multiple modifiers of the same type are attached — lower priority runs first. The attachment is stored in element_modifiers table. Cascade resolution runs at bond-time to merge this modifier into the target’s resolved config. Common mistake: attaching to an incompatible element type — check topology rules first.

`compare`

Get /ops/compare | Auth: Read

Compare evaluation results over time

Compare score trends between two time windows. Set period to day, week (default), or month. Returns current_avg, previous_avg, change percentage, and trend (improving/stable/declining). Useful for tracking quality over time.

`criteria`

Get /ops/criteria | Auth: Read

Get configured evaluation criteria

Get the evaluator’s configured criteria definitions from spec. Returns criteria array with name, description, weight, threshold, and type for each criterion. Also returns the overall pass_threshold.

`delete`

Delete /ops/delete | Auth: Admin

Delete element (soft delete)

Soft delete — sets state to ‘deleted’ but retains the record. Cannot delete elements that have children (has_no_bond precondition) or active runs. Requires admin auth and confirmation.

`detach`

Post /ops/detach | Auth: Read

Detach this modifier from a target element

Removes this modifier from a target element. Requires the target_id. Pervasive modifiers (audit, policy) can only be detached at the level they were originally attached — inherited pervasive modifiers cannot be detached by child elements. After detach, cascade resolution re-runs to remove this modifier’s effect from the resolved config.

`disable`

Post /ops/disable | Auth: Admin

Disable element (hides and prevents use)

Idempotent — safe to call on already-disabled elements. Optionally pass a reason string. Disabled elements cannot be invoked or executed. Inverse of enable.

`enable`

Post /ops/enable | Auth: Admin

Enable element (makes usable and visible)

Idempotent — safe to call on already-enabled elements. Transitions element to ready/enabled state. Cannot enable deleted elements. Inverse of disable.

`evaluate`

Post /ops/evaluate | Auth: Execute

Run evaluation against input data

Score content against criteria. Pass data.response or data.output for inline content, or data.generation_id to evaluate an agent generation. Criteria come from input.criteria (array or dict) or fall back to spec.criteria. Array: [{name, weight, threshold}]. Dict: {name: weight} or {name: {weight, threshold, rubric}}. Returns evaluation_id and scorecard with 0-100 scores. For inline code (spec.code.source + spec.code.language), function must be named handler(), main(), or entrypoint(). Return format: {score: N} or {overall_score: N, criteria: {name: score}} where N is 0.0-1.0.

`get`

Get /ops/get | Auth: Read

Get element details

Element is already resolved by the routing layer — this returns the cached element, not a fresh DB query. Use the path /api/{circle}/{slug} to address elements.

`get_attached_modifiers`

Get /ops/attached/{target_id} | Auth: Read

Get all modifiers attached to a target element

Lists all modifiers attached to a specific target element, including modifier_id, type, subcategory, and priority. Useful for debugging cascade resolution or understanding which policies apply to an element before invoking it.

`intention`

Get /ops/intention | Auth: Read

Get element intention with full inheritance chain

Returns three levels: direct (this element’s intention), inherited (from category and root), and resolved (final merged intention). Useful for understanding an element’s purpose in context of its hierarchy.

`list_attachments`

Get /ops/targets | Auth: Read

List all elements this modifier is attached to

Returns all target elements where this modifier is currently applied. Shows target_id, target_type, priority, and cascade_policy.

`readme_update`

Post /ops/readme_update | Auth: Write

Update element README.md content

Creates or overwrites README.md in the element’s git repo. Commits to the draft branch. Content must be provided as a markdown string.

`result_get`

Get /ops/results/{evaluation_id} | Auth: Read

Get a specific evaluation result

Get full scorecard details including per-criterion scores, input data, and timestamps. Requires evaluation_id. Only returns results belonging to this evaluator element.

`results`

Get /ops/results | Auth: Read

List evaluation results (scorecards)

List historical evaluation scorecards. Filter by ?passed=true|false, ?since=, ?subject_element_id, ?subject_generation_id. Returns pass_rate across matching results. Paginate with ?limit (max 200) and ?offset.

`schema`

Get /ops/schema | Auth: Read

Get element input/output schema (MCP tools/list compatible)

Returns type-level port schemas from the TypeRegistry — not instance-specific overrides. Includes direction (input/output), required flag, and JSON schema per port. Useful for understanding what data an element accepts and produces.

`update`

Patch /ops/update | Auth: Write

Update element

Partial update — send only the fields you want to change. spec, name, and intention are all independently optional. spec MUST be a JSON object when present; deep-merged into the existing spec by default. Empty {"spec":{}} preserves existing spec content but still records a new version (no-op for content, not for version state). To clear/replace the entire spec wholesale send {"spec":{...},"deep":false}. List-typed spec fields use replace semantics (the patch list replaces the existing list, no array merging). Coordinates Git + DB writes. Slug cannot be changed after creation.

Error Codes

Code	Class	Retryable	Description
`EVALUATOR_CRITERIA_INVALID`	validation	no	Invalid evaluation criteria
`EVALUATOR_CONTENT_EMPTY`	validation	no	No content to evaluate

Lifecycle / runtime

Defined for this element

Before invoke

validate_input
check_rate_limit

After invoke

record_metrics
emit_traces

On error

log_error
record_error_metric

Observability

Defined for this element

Metrics

evaluation_count
duration_ms
error_rate

Events

evaluator.completed
evaluator.failed

Pricing / cost

Platform default

Operation costs

create: free
update: free
delete: free
get: free
list: free
invoke: 10000 micro-AU
tool_use: free

Working with it

How it appears

Evaluator

When to use / not

When to use

When not to use

Topology

Properties

Capabilities

Operations

Ports

Inputs

Composition

Errors / when it fails

Validation rules

Evaluator (evaluator)

Guide

What It Does

Element Definition

Properties

Ports

Topology

Capabilities

Error Codes

Quick Start

Creating via API

Basic Usage

Project Patterns

How Evaluator Fits Into Projects

Example Project Spec

Common Patterns

Auto-Evaluator Attached to an Agent

Sampling for Production Pipelines

Attaching Resources

Applying Modifiers

Common Mistakes

Relationships

Capabilities

Properties

Operations

attach

compare

criteria

delete

detach

disable

enable

evaluate

get

get_attached_modifiers

intention

list_attachments

readme_update

result_get

results

schema

update

Error Codes

Lifecycle / runtime

Before invoke

After invoke

On error

Observability

Metrics

Events

Pricing / cost

Operation costs

Set it up

Related elements

Related concepts

Related recipes

`attach`

`compare`

`criteria`

`delete`

`detach`

`disable`

`enable`

`evaluate`

`get`

`get_attached_modifiers`

`intention`

`list_attachments`

`readme_update`

`result_get`

`results`

`schema`

`update`