Eval Harness Agent

◆ Autonomous Judge

Gates every agent release. It runs gold-set regression suites in CI, orchestrates LLM-as-judge and agent-as-judge scoring, fires adversarial red-team prompts, and runs champion/challenger bake-offs before any new prompt or model version reaches production. A drop on any safety-critical suite blocks the release.

Memory

Working The candidate version under test and its running scorecard.

Episodic Every prior eval run per agent: the full regression history.

Semantic Gold sets, rubrics, red-team attack library, pass thresholds.

Procedural Which eval suites matter for which agent class.

Store Eval result warehouse + immutable run ledger

Orchestration

orchestrator-worker MCPA2A

Harness · Managed Agents: orchestrator spawning parallel eval workers per suite; sandboxed code-exec for scoring; results persisted to an immutable eval ledger.

Tools

{ } CI/CD pipeline API ⌕ Gold-set + rubric store Retrieval ⇄ Judge models (LLM-as-judge / agent-as-judge) A2A ›_ Scoring + statistics sandbox Code exec ›_ Red-team prompt generator Code exec

Evals & guardrails

The eval rig evals itself: judge calibration checked against the board-anchored ground-truth gold set.
Champion/challenger required before any version promotion; no silent swaps.
Regression gate is hard: a drop on any safety-critical suite blocks the release.
Red-team suite refreshed continuously from new jailbreak/abuse patterns.

Offline reflection

Offline analysis of production failures (via the observability agent) folds new cases into the gold sets. Reflexion/SEAL-style self-improvement of the eval suite, run offline, every gold-set change re-derived and gated by an independent judge agent.

Frontier edge

▲Formal action-gating: the regression gate is a provable release bound; a candidate that fails any safety-critical suite cannot be promoted, and the verdict is cryptographically signed and replayable.
▲Self-improving fleet: production failures harvested fleet-wide are auto-distilled into new gold-set cases (Reflexion/SEAL-style), so the bar rises across every agent between releases.
▲World-model simulation: bakes off challenger versions against held-out and synthetically perturbed cases to predict prod behaviour before any traffic touches them.

A sample run

Trigger A new prompt version for the Sanctions Disposition Agent is opened as a PR.

1Run the gold-set regression suite; spawn judge workers to score dispositions.
2Fire the red-team battery (alias-evasion, prompt-injection in counterparty names).
3Run champion/challenger against the live version on held-out true-match cases.
4Compile the scorecard and post it back on the PR.

Output PR blocked: the challenger improved false-positive release but missed one gold true-match, a hard fail on a safety-critical suite. Scorecard and the failing case attached for the author.

In numbers

100% of fleet

Agents under regression eval

3,400

Eval runs / day

100%

Regressions caught pre-prod

Handoffs

Fed by ← Fleet Observability Agent ← Code Review Agent

Hands to → Agent Registry & Protocol Agent