Verification & Quality¶

This page covers the quality-assurance pipeline attached to agent output: the verification stage that runs after an agent completes a task, the harness middleware that wraps every agent invocation, the review pipeline that validates produced artifacts, and the intake engine that ingests new work.

Verification Stage¶

Verification is a first-class stage in the workflow engine. Three converging research sources (Marco DeepResearch on verification-centric agent frameworks, GEMS on the five-stage agent loop with explicit Verifier, and the Anthropic three-agent harness with Planner/Generator/Evaluator and calibrated grading) all converge on verification as a separate agent with its own context, not a self-evaluation inside the generator step.

Workflow Node and Edge Types¶

WorkflowNodeType.VERIFICATION is a control-flow node like CONDITIONAL. Three dedicated edge types route verification outcomes:

VERIFICATION_PASS: artifact accepted
VERIFICATION_FAIL: artifact rejected, routed to regeneration
VERIFICATION_REFER: confidence below threshold, escalated to human review

Blueprint validation enforces exactly one of each edge type per verification node.

Calibrated Rubric Grading¶

Each verification node references a VerificationRubric by name. A rubric contains:

Criteria (RubricCriterion): weighted dimensions with binary, ternary, or score grade types
Calibration examples: few-shot demonstrations for LLM graders
Minimum confidence: below this threshold, the verdict is overridden to REFER

Built-in rubrics: frontend-design (four criteria: design/originality/craft/functionality) and default-task (correctness/completeness/probe-adherence).

Atomic Criteria Decomposition¶

Acceptance criteria are decomposed into atomic binary probes (AtomicProbe) via a pluggable CriteriaDecomposer protocol. The default LLMCriteriaDecomposer uses the medium-tier provider. An IdentityCriteriaDecomposer maps each criterion to one probe for deterministic testing.

Structured Handoff Artifacts¶

HandoffArtifact carries the payload, artifact references, probes, and optional rubric between stages. A model validator rejects self-handoff (from_agent_id == to_agent_id). Immutability is enforced by the frozen Pydantic model (frozen=True).

Self-Evaluation Rejection¶

Self-evaluation (where the generator also judges its own output) is explicitly rejected. Prior research documents that self-evaluation produces over-confidence and fails to catch the generator's own blind spots. VerificationResult.evaluator_agent_id MUST differ from the generator agent ID; enforced by model validator at construction.

Pluggable Grading¶

The RubricGrader protocol follows the standard protocol + strategy + factory + config discriminator pattern (mirroring engine/classification/). Variants: LLM (production) and HEURISTIC (testing/fallback). Configuration via VerificationConfig.

Harness Middleware Layer¶

The engine uses a composable middleware layer for cross-cutting concerns that span agent execution and multi-agent coordination. Two separate protocols serve two distinct pipelines.

Agent Middleware¶

Protocol: AgentMiddleware (engine/middleware/protocol.py). Six async hooks in declared order:

Hook	Runs	Purpose
`before_agent`	Once on invocation	Load memory, validate input, record hashes
`before_model`	Before each model call	Trim history, redact PII, inject context
`wrap_model_call`	Around model call	Caching, dynamic tools, model swap
`wrap_tool_call`	Around tool execution	Inject context, gate tools
`after_model`	After model responds	Human-in-loop, assumption-violation checks
`after_agent`	Once on completion	Save results, notify, cleanup

Composition: before_* left-to-right, after_* right-to-left, wrap_* onion-style (each wraps the next). Exceptions propagate to the classification pipeline.

Default chain: checkpoint_resume, delegation_chain_hash, authority_deference, sanitize_message, security_interceptor, policy_gate, approval_gate, assumption_violation, classification, cost_recording.

Optional middleware (registered in _AGENT_OPT_IN, must be enabled explicitly):

SemanticDriftDetector (after_model slot): compares model output against task acceptance criteria using cosine similarity. Opt-in via CompanyConfig.security.semantic_drift_enabled. Fail-soft: logs warnings but never blocks.

Coordination Middleware¶

Protocol: CoordinationMiddleware (engine/middleware/coordination_protocol.py). Five async hooks:

Hook	Pipeline Position	Purpose
`before_decompose`	Before Phase 1	Clarification gate
`after_decompose`	After Phase 1	Post-decomposition analysis
`before_dispatch`	Before Phase 3-5	Plan review gate, task ledger
`after_rollup`	After Phase 6	Progress ledger, replan hook
`before_update_parent`	Before Phase 7	Authority deference scan

Default chain: clarification_gate, task_ledger, plan_review_gate, progress_ledger, coordination_replan, authority_deference_coordination.

S1 Constraint Hooks¶

Middleware	Hook	Behaviour
`AuthorityDeferenceGuard`	`before_agent`	Detects authority cues in transcripts, logs patterns, injects justification header
`AssumptionViolationMiddleware`	`after_model`	Detects broken assumptions, emits escalation events
`ClarificationGateMiddleware`	`before_decompose`	Validates acceptance criteria specificity
`DelegationChainHashMiddleware`	`before_agent`	Records SHA-256 content hash for delegation drift detection

Configuration¶

Per-company: CompanyConfig.middleware (MiddlewareConfig) with agent and coordination sub-configs.

Per-task: Task.middleware_override replaces the company-level chain when set.

Error Semantics¶

Middleware exceptions propagate to the classification pipeline. ClassificationResult.action decides: retry, escalate, or fail. No silent swallowing.

Review Pipeline¶

The review pipeline provides a configurable chain of review stages for tasks in IN_REVIEW status. See the Client Simulation design page for the full architecture, including ReviewStage protocol, pipeline execution semantics, and metadata tracking.

Key design decisions:

No new TaskStatus values for pipeline tracking; tasks stay IN_REVIEW throughout, with progress tracked in task metadata.
Short-circuit on FAIL: first failing stage sends the task back to IN_PROGRESS for rework with the stage name and reason in metadata.
Default fallback: when no pipeline is configured, the existing ReviewGateService single-stage behaviour runs.

Intake Engine¶

The intake engine processes ClientRequest submissions through an independent state machine (RequestStatus) before creating tasks in the task engine. The real work-entry path (POST /requests/{id}/approve) approves a request and runs it through the IntakeEntryAdapter into the work pipeline spine so an agent executes it; the terminal state lands asynchronously. See Client Simulation for the full request lifecycle, intake strategy contracts, and the real work-entry path.

Vision Verifier Gate¶

The vision verifier is the UI cousin of the adversarial red-team gate: where the red-team gate attacks a text deliverable, the vision gate judges whether a running GUI deliverable matches its brief. It is opt-in (CompanyConfig.security.vision_verify.enabled, off by default) and fires after the red-team gate, before the IN_REVIEW -> COMPLETED transition.

A pluggable VisionVerifier (security/visionverify/) follows the standard protocol + strategy + factory + config discriminator pattern:

noop (default): inert; returns a clean report.
heuristic: deterministic, no LLM. Checks structured VisualExpectation entries (e.g. dominant colour) against the captured screenshots. Used by the acceptance test so a brief-mismatch BLOCK is reproducible.
llm_vision: sends the screenshots (as multimodal image_parts) plus the fenced brief to a vision-capable model and parses a structured verdict from a tool call. Gated on ModelCapabilities.supports_vision.

The VisionVerifierGate maps the report's findings to a verdict (PASS / PASS_WITH_FINDINGS / BLOCK) via the same severity x autonomy routing matrix as the red-team gate. Self-evaluation is rejected (the verifier identity must differ from the deliverable's generator). A verifier fault fails OPEN (a synthetic INFO finding) so a fault never blocks completion. SEC-1: the untrusted brief / criteria are wrapped with wrap_untrusted before reaching the model; screenshot bytes travel as structured image_parts, not as prompt text, and are elided from the cassette's human-readable copy.

Order of Operations¶

Five quality and approval surfaces (verification stage, review pipeline, mid-execution AUTH_REQUIRED park, post-completion IN_REVIEW gate, adversarial red-team gate) operate at distinct points in the task lifecycle.

Phase	Surface	Trigger	Task status during	Exit	Where documented
Mid-execution	`AUTH_REQUIRED` park	Agent calls a tool that requires approval at runtime (e.g. `deploy`, `db:admin`). Driven by `ApprovalGate` middleware.	`AUTH_REQUIRED`	Approved: returns to `ASSIGNED`. Denied / timeout: `CANCELLED`.	Security: Approval Workflow
Agent done	Verification stage	Workflow blueprint has a `VERIFICATION` control-flow node. Runs as a separate evaluator agent with its own context.	`IN_PROGRESS` (engine-internal)	Pass: continue to next node. Fail: regenerate. Refer: hand to human via `VERIFICATION_REFER` edge.	This page, Workflow Node and Edge Types
Agent done	Review pipeline	Task transitions `IN_PROGRESS` to `IN_REVIEW`. Chain of `ReviewStage` instances runs.	`IN_REVIEW`	First-failing stage returns the task to `IN_PROGRESS`; all-pass moves to `COMPLETED`.	This page, Review Pipeline
Review pipeline PASS	Red-team gate	Opt-in (`CompanyConfig.security.red_team.enabled`). Fires when the review pipeline returns its COMPLETED verdict, BEFORE the task-engine transition lands.	`IN_REVIEW`	BLOCK: routes back to `IN_PROGRESS` with the red-team summary as the rework reason. PASS / PASS_WITH_FINDINGS: pipeline's verdict stands.	Security: Adversarial Red-Team Gate
Red-team gate PASS	Vision verifier gate	Opt-in (`CompanyConfig.security.vision_verify.enabled`). The UI cousin of the red-team gate: fires after the red-team gate for GUI deliverables that carry screenshots (`vision_input`). Pluggable `VisionVerifier` (`noop` / `heuristic` / `llm_vision`) judges whether the running app matches the brief.	`IN_REVIEW`	BLOCK: routes back to `IN_PROGRESS` with the vision summary as the rework reason. PASS / PASS_WITH_FINDINGS: prior verdict stands. Absent screenshots: SKIP (non-GUI deliverable).	This page, Vision Verifier Gate

Key invariants:

AUTH_REQUIRED is the mid-execution park reason and uses the ApprovalGate middleware in the agent harness. The review pipeline is the post-completion quality gate and uses ReviewGateService. The two are independent: a single task can encounter both (e.g. pause for deploy approval mid-task, then enter IN_REVIEW once the agent finishes).
The verification stage runs BEFORE the review pipeline when both are configured for the same workflow. Verification is a workflow blueprint construct (a node in the graph); the review pipeline fires on the IN_PROGRESS to IN_REVIEW transition that happens after the workflow's last node completes.
The review pipeline does not mint new TaskStatus values; the task stays at IN_REVIEW throughout, with stage progress in metadata.