Download all docs
modifiers

Evaluator

The quality gate for automated pipelines — an evaluator scores another actor's output against weighted criteria and emits a standardized 0–100 scorecard, using custom code, an LLM-as-judge, or both, so a run can be measured the same way every time.

Working with it

Opening a Evaluator launches a code editor — its dedicated working surface.

How it appears

The same element type rendered as a definition, a circle instance, and a live workspace card.

Ev
type

Evaluator

Validate and score actor outputs using custom code

modifiersmodifierdefinition

When to use / not

When to use

  • Scoring an agent's output after a task — attach it as a modifier and, once you turn its auto mode on (off by default, so attaching one never silently scores every call), the runtime runs it on each invocation and flags low-quality results.
  • Grading along multiple weighted dimensions (accuracy, clarity, empathy, …) into one comparable score rather than a single yes/no check.
  • Using an LLM-as-judge to rate subjective output an assertion can't express — turn on llm_assisted and point it at a model and prompt.
  • Tracking whether quality is improving or regressing over time, and sampling only a fraction of high-throughput output to keep cost down in live stages.

When not to use

  • Stopping a flow when output is bad — an evaluator records a score, it does not halt execution; branch on the result with a condition element instead.
  • Enforcing a hard input schema or rejecting malformed data outright — that is the validation modifier's job; an evaluator measures quality, it doesn't bounce bad shapes.
  • Stripping or masking forbidden content — reach for filter-words; an evaluator scores, it doesn't rewrite.

Topology

Attaches to another element as a modifier, shaping that element's behaviour rather than running on its own.

Properties

criteriaarray
Evaluation criteria — each criterion is scored independently then aggregated
scaleobject
Evaluation scale — defines the range for criterion scores. Scores are normalized to 0-100 internally.
modestring
Evaluation mode. 'code' forces the inline/handler code path (deterministic, fast). 'llm' forces LLM-assisted evaluation (flexible, uses rubrics). 'delegate' forces scoring via the referenced actor in spec.element.ref (an agent or function). 'auto' selects delegate if spec.element.ref is set, else LLM if llm_assisted.enabled is true, otherwise falls back to code.
elementobject
Delegate scoring to a referenced actor element (an agent or function) instead of inline code or the LLM judge. The referenced element is invoked with the content to evaluate, and its output is normalized into the standard scorecard.
autoboolean
Auto-invoke this evaluator after every run of the element it is attached to. When true, the platform runs this evaluator against the target's output automatically (no explicit evaluate call) and emits an evaluator.auto_evaluate.completed event — the basis of the test-driven evaluator loop. Default off so attaching an evaluator never silently adds an evaluation to every production invocation. Honors sampling.rate.
scoring_methodstring
Scoring methodology
thresholdsobject
Score thresholds for pass/fail (0-100)

Capabilities

Defined for this element
  • Compute
  • Observe

Operations

  • attachPOST
  • compareGET
  • criteriaGET
  • deleteDELETE
  • detachPOST
  • disablePOST
  • enablePOST
  • evaluatePOST
  • getGET
  • get_attached_modifiersGET
  • intentionGET
  • list_attachmentsGET
  • readme_updatePOST
  • result_getGET
  • resultsGET
  • schemaGET
  • updatePATCH

Ports

Inputs

  • contentrequest
  • criteriarequest
  • resultrequest

Composition

Errors / when it fails

When llm_assisted.enabled is true, llm_assisted.model_ref must be provided
When code.language is set, code.source must be provided

Validation rules

  • Low sampling rate (<10%) — most outputs will not be evaluated
  • LLM-assisted evaluation without criteria may produce inconsistent results

Evaluator (evaluator)

Category: modifiers | Form: | Symbol: Ev

Validate and score actor outputs using custom code

Scores actor outputs against defined criteria, producing standardized scorecards (0-100 score, pass/fail). Three modes: (1) LLM-assisted — set spec.llm_assisted.enabled: true and optionally spec.llm_assisted.model_ref (a brain element; defaults to the platform evaluator model). (2) Runtime code — set spec.code.source + spec.code.language like a function, code must return {score, criteria:[]}. (3) Delegate — set spec.element.ref to an agent or function slug to score with that actor (mode: delegate, or auto when a ref is set). Define criteria in spec.criteria array with name, weight, rubric fields. Threshold defaults to 70 (spec.thresholds.pass). Scores stored as 0-100 range; values 0-1 are auto-scaled to 0-100. Can evaluate by generation_id (references an agent generation) or by passing data.response directly. Use compare operation to track score trends over time. Common mistake: not configuring LLM-assisted, runtime code, or a delegate ref — one is required.

Guide

Validate and score actor outputs using custom code

What It Does

Evaluator measures the quality of another actor’s output against a set of weighted criteria and produces a standardized scorecard. It can use custom code (JavaScript or Python), LLM-as-judge evaluation, or both in combination. When attached to an agent element, it runs automatically after each invocation and flags low-quality outputs before they reach downstream elements. Evaluators are the quality gate in automated pipelines.

Element Definition

PropertyValue
Typeevaluator
Categorymodifiers
Formmodifier
Symbolfact_check / #10B981

Properties

FieldTypeDefaultDescription
handlerstringReference to the handler code/function for evaluation
criteriaarrayEvaluation criteria (required). Each item: name, weight (0–1), evaluator (code ref)
scoring_methodstring (enum)numericbinary, numeric, categorical, or weighted
thresholds.passnumber70Minimum score to pass (0–100)
thresholds.excellentnumber90Score threshold for excellent rating (0–100)
code.languagestring (enum)javascriptEvaluation code language: javascript or python
code.sourcestringInline code or reference to a code element
code.timeout_msinteger5000Maximum code execution time in milliseconds
input_schemaobjectJSON Schema for the inputs being evaluated
output_schema.include_reasoningbooleantrueInclude per-criterion explanation in results
output_schema.include_suggestionsbooleanfalseInclude improvement suggestions in results
sampling.ratenumber1.0Fraction of outputs to evaluate (0–1)
sampling.strategystring (enum)allall, random, error_only, or periodic
llm_assisted.enabledbooleanfalseEnable LLM-as-judge evaluation
llm_assisted.model_refstringReference to LLM supplier element
llm_assisted.prompt_refstringReference to evaluation prompt element
alerts.on_failurebooleantrueSend alert when evaluation fails
alerts.thresholdnumberScore below which to send an alert (0–100)
storage.enabledbooleantrueStore evaluation results
storage.retention_daysinteger30Days to retain evaluation records

Ports

DirectionPortTypeRequiredDescription
InputcontentrequestYesContent to evaluate with optional reference/ground truth
InputcriteriarequestYesEvaluation criteria with weights
OutputresultrequestYesEvaluation result: overall score, per-criterion breakdown

Topology

  • Lives in: modifiers/governance/evaluator/ repository
  • Referenced by: projects
  • Accepts modifiers: rate-limit, auth-policy, requirements
  • Uses resources: prompt, variable, sql, document, vector

Capabilities

CapabilityDescription
llm-judgeLLM-as-judge evaluation
weighted-scoringWeighted multi-criteria scoring
consistencyConsistent scoring across runs

Error Codes

CodeClassRetryableDescription
EVALUATOR_CRITERIA_INVALIDvalidationNoInvalid evaluation criteria
EVALUATOR_CONTENT_EMPTYvalidationNoNo content to evaluate

Quick Start

Creating via API

Create an evaluator inside a project:

POST /api/{circle}/{project}/
Content-Type: application/json

{
  "element_type": "evaluator",
  "slug": "response-quality",
  "name": "Response Quality Evaluator",
  "spec": {
    "scoring_method": "weighted",
    "thresholds": {
      "pass": 75,
      "excellent": 90
    },
    "criteria": [
      {"name": "accuracy", "weight": 0.5, "evaluator": "check_accuracy"},
      {"name": "clarity", "weight": 0.3, "evaluator": "check_clarity"},
      {"name": "completeness", "weight": 0.2, "evaluator": "check_completeness"}
    ]
  }
}

Basic Usage

Invoke the evaluator directly with content and criteria:

POST /api/{circle}/{project}/response-quality/ops/invoke
Content-Type: application/json

{
  "content": "The capital of France is Paris, located in the north of the country.",
  "reference": "Paris is the capital of France."
}

Project Patterns

How Evaluator Fits Into Projects

Evaluators serve two roles in a project. The first is as an attached modifier on an agent — in this mode, the evaluator runs automatically after every invocation and flags low-quality outputs. The second is as a standalone element wired into a flow — in this mode, the flow explicitly routes output through the evaluator and branches on the pass/fail result using a condition element.

In development stage, evaluators run on every output. In demo and live stages, the sampling.rate property allows you to evaluate a representative fraction rather than every single invocation.

Example Project Spec

# LLM-judge evaluator for customer support responses
elements:
  - element_type: evaluator
    slug: support-evaluator
    spec:
      scoring_method: weighted
      llm_assisted:
        enabled: true
        model_ref: "claude-sonnet-4-6"
      criteria:
        - name: empathy
          weight: 0.3
          evaluator: rate_empathy
        - name: accuracy
          weight: 0.5
          evaluator: rate_accuracy
        - name: brevity
          weight: 0.2
          evaluator: rate_brevity
      thresholds:
        pass: 70
        excellent: 88
      output_schema:
        include_reasoning: true
        include_suggestions: true

Common Patterns

Auto-Evaluator Attached to an Agent

Attach an evaluator to an agent so every task is scored automatically:

# Create the evaluator
POST /api/{circle}/{project}/

{
  "element_type": "evaluator",
  "slug": "task-quality-gate",
  "spec": {
    "scoring_method": "binary",
    "criteria": [
      {"name": "task_completed", "weight": 1.0}
    ],
    "thresholds": {"pass": 80}
  }
}
# Attach it to the agent
POST /api/{circle}/{project}/task-quality-gate/ops/attach

{ "target_id": "<agent-uuid>" }

When the agent completes a task, the evaluator runs automatically and the result is recorded against the task.

Sampling for Production Pipelines

For high-throughput flows in live stage, evaluate only a random 10% sample to reduce cost:

POST /api/{circle}/{project}/

{
  "element_type": "evaluator",
  "slug": "sampled-evaluator",
  "spec": {
    "scoring_method": "numeric",
    "criteria": [{"name": "quality", "weight": 1.0}],
    "sampling": {
      "rate": 0.1,
      "strategy": "random"
    },
    "storage": {
      "enabled": true,
      "retention_days": 90
    }
  }
}

Attaching Resources

Evaluators can use data resources to look up reference material or ground truth during scoring:

POST /api/{circle}/{project}/{resource-slug}/ops/attach

{ "target_id": "<evaluator-uuid>" }
ResourceUse case
vectorSemantic similarity scoring against a reference corpus
documentLoad rubrics or scoring guidelines
sqlLook up expected values from a database
promptInject evaluation prompt template for LLM-as-judge mode

Applying Modifiers

ModifierUse case
rate-limitPrevent evaluation floods when sampling.strategy is all
auth-policyRestrict who can view evaluation results
requirementsEnforce that required data resources are available before evaluating

Common Mistakes

Missing required criteria and scoring_method. Both are required fields. An evaluator with no criteria has nothing to score against and will fail validation before it runs.

Weights that do not sum to 1.0 in weighted mode. In weighted scoring, weights should sum to 1.0 across all criteria. Values like 0.5, 0.3, 0.2 sum correctly. Weights that sum above 1.0 inflate scores; weights that sum below 1.0 suppress them.

Expecting the evaluator to halt a flow on failure. When used as an attached modifier, the evaluator records a score but does not automatically stop execution. To gate on scores, wire the evaluator into a flow explicitly and use a condition element to branch on the result.score value.

Using evaluator form (modifier) versus evaluator element type. Evaluator is both an element type and a valid entry in attaches: for agent. When you see evaluator in an attaches: list, that means attach this evaluator element as a modifier — the runtime calls it automatically after the parent actor completes.

Relationships

  • Attaches to: circle, rate-limit, auth-policy, app, automation, document, graph, sql, timeseries, vector, files, python, javascript, ruby, rust-fn, go-fn, csharp
  • Uses: prompt, variable, sql, document, vector

Capabilities

  • llm-judge: LLM-as-judge evaluation
  • weighted-scoring: Weighted multi-criteria scoring
  • consistency: Consistent scoring across runs

Properties

PropertyTypeDefaultDescription
handlerstringReference to the handler code/function for evaluation
criteriaarrayEvaluation criteria — each criterion is scored independently then aggregated
scaleobjectEvaluation scale — defines the range for criterion scores. Scores are normalized to 0-100 internally.
modestring"auto"Evaluation mode. ‘code’ forces the inline/handler code path (deterministic, fast). ‘llm’ forces LLM-assisted evaluation (flexible, uses rubrics). ‘delegate’ forces scoring via the referenced actor in spec.element.ref (an agent or function). ‘auto’ selects delegate if spec.element.ref is set, else LLM if llm_assisted.enabled is true, otherwise falls back to code.
elementobjectDelegate scoring to a referenced actor element (an agent or function) instead of inline code or the LLM judge. The referenced element is invoked with the content to evaluate, and its output is normalized into the standard scorecard.
autobooleanfalseAuto-invoke this evaluator after every run of the element it is attached to. When true, the platform runs this evaluator against the target’s output automatically (no explicit evaluate call) and emits an evaluator.auto_evaluate.completed event — the basis of the test-driven evaluator loop. Default off so attaching an evaluator never silently adds an evaluation to every production invocation. Honors sampling.rate.
scoring_methodstring"numeric"Scoring methodology
thresholdsobjectScore thresholds for pass/fail (0-100)
codeobjectCustom evaluation code — inline or reference
input_schemaobjectSchema for inputs to evaluate
output_schemaobjectExpected evaluation result schema
samplingobjectControls which outputs get evaluated — useful for high-volume actors
llm_assistedobjectAI-assisted evaluation — uses an LLM to judge quality
alertsobjectAlert configuration for failed evaluations
storageobjectEvaluation result storage — stores scores and reasoning for analysis

Operations

attach

Post /ops/attach | Auth: Read

Attach this modifier to a target element

Attaches this modifier to a target element. The target_id must be a UUID of an existing element that supports this modifier type (check applies_to in definition.yaml). Priority controls evaluation order when multiple modifiers of the same type are attached — lower priority runs first. The attachment is stored in element_modifiers table. Cascade resolution runs at bond-time to merge this modifier into the target’s resolved config. Common mistake: attaching to an incompatible element type — check topology rules first.

compare

Get /ops/compare | Auth: Read

Compare evaluation results over time

Compare score trends between two time windows. Set period to day, week (default), or month. Returns current_avg, previous_avg, change percentage, and trend (improving/stable/declining). Useful for tracking quality over time.

criteria

Get /ops/criteria | Auth: Read

Get configured evaluation criteria

Get the evaluator’s configured criteria definitions from spec. Returns criteria array with name, description, weight, threshold, and type for each criterion. Also returns the overall pass_threshold.

delete

Delete /ops/delete | Auth: Admin

Delete element (soft delete)

Soft delete — sets state to ‘deleted’ but retains the record. Cannot delete elements that have children (has_no_bond precondition) or active runs. Requires admin auth and confirmation.

detach

Post /ops/detach | Auth: Read

Detach this modifier from a target element

Removes this modifier from a target element. Requires the target_id. Pervasive modifiers (audit, policy) can only be detached at the level they were originally attached — inherited pervasive modifiers cannot be detached by child elements. After detach, cascade resolution re-runs to remove this modifier’s effect from the resolved config.

disable

Post /ops/disable | Auth: Admin

Disable element (hides and prevents use)

Idempotent — safe to call on already-disabled elements. Optionally pass a reason string. Disabled elements cannot be invoked or executed. Inverse of enable.

enable

Post /ops/enable | Auth: Admin

Enable element (makes usable and visible)

Idempotent — safe to call on already-enabled elements. Transitions element to ready/enabled state. Cannot enable deleted elements. Inverse of disable.

evaluate

Post /ops/evaluate | Auth: Execute

Run evaluation against input data

Score content against criteria. Pass data.response or data.output for inline content, or data.generation_id to evaluate an agent generation. Criteria come from input.criteria (array or dict) or fall back to spec.criteria. Array: [{name, weight, threshold}]. Dict: {name: weight} or {name: {weight, threshold, rubric}}. Returns evaluation_id and scorecard with 0-100 scores. For inline code (spec.code.source + spec.code.language), function must be named handler(), main(), or entrypoint(). Return format: {score: N} or {overall_score: N, criteria: {name: score}} where N is 0.0-1.0.

get

Get /ops/get | Auth: Read

Get element details

Element is already resolved by the routing layer — this returns the cached element, not a fresh DB query. Use the path /api/{circle}/{slug} to address elements.

get_attached_modifiers

Get /ops/attached/{target_id} | Auth: Read

Get all modifiers attached to a target element

Lists all modifiers attached to a specific target element, including modifier_id, type, subcategory, and priority. Useful for debugging cascade resolution or understanding which policies apply to an element before invoking it.

intention

Get /ops/intention | Auth: Read

Get element intention with full inheritance chain

Returns three levels: direct (this element’s intention), inherited (from category and root), and resolved (final merged intention). Useful for understanding an element’s purpose in context of its hierarchy.

list_attachments

Get /ops/targets | Auth: Read

List all elements this modifier is attached to

Returns all target elements where this modifier is currently applied. Shows target_id, target_type, priority, and cascade_policy.

readme_update

Post /ops/readme_update | Auth: Write

Update element README.md content

Creates or overwrites README.md in the element’s git repo. Commits to the draft branch. Content must be provided as a markdown string.

result_get

Get /ops/results/{evaluation_id} | Auth: Read

Get a specific evaluation result

Get full scorecard details including per-criterion scores, input data, and timestamps. Requires evaluation_id. Only returns results belonging to this evaluator element.

results

Get /ops/results | Auth: Read

List evaluation results (scorecards)

List historical evaluation scorecards. Filter by ?passed=true|false, ?since=, ?subject_element_id, ?subject_generation_id. Returns pass_rate across matching results. Paginate with ?limit (max 200) and ?offset.

schema

Get /ops/schema | Auth: Read

Get element input/output schema (MCP tools/list compatible)

Returns type-level port schemas from the TypeRegistry — not instance-specific overrides. Includes direction (input/output), required flag, and JSON schema per port. Useful for understanding what data an element accepts and produces.

update

Patch /ops/update | Auth: Write

Update element

Partial update — send only the fields you want to change. spec, name, and intention are all independently optional. spec MUST be a JSON object when present; deep-merged into the existing spec by default. Empty {"spec":{}} preserves existing spec content but still records a new version (no-op for content, not for version state). To clear/replace the entire spec wholesale send {"spec":{...},"deep":false}. List-typed spec fields use replace semantics (the patch list replaces the existing list, no array merging). Coordinates Git + DB writes. Slug cannot be changed after creation.

Error Codes

CodeClassRetryableDescription
EVALUATOR_CRITERIA_INVALIDvalidationnoInvalid evaluation criteria
EVALUATOR_CONTENT_EMPTYvalidationnoNo content to evaluate

Lifecycle / runtime

Defined for this element

Before invoke

  • validate_input
  • check_rate_limit

After invoke

  • record_metrics
  • emit_traces

On error

  • log_error
  • record_error_metric

Observability

Defined for this element

Metrics

  • evaluation_count
  • duration_ms
  • error_rate

Events

  • evaluator.completed
  • evaluator.failed

Pricing / cost

Platform default

Operation costs

  • create: free
  • update: free
  • delete: free
  • get: free
  • list: free
  • invoke: 10000 micro-AU
  • tool_use: free

Set it up

Criteriastring
What to evaluate
Scoring Methodstring