◆ Autonomous Judge
Gates every agent release. It runs gold-set regression suites in CI, orchestrates LLM-as-judge and agent-as-judge scoring, fires adversarial red-team prompts, and runs champion/challenger bake-offs before any new prompt or model version reaches production. A drop on any safety-critical suite blocks the release.
Memory
Working The candidate version under test and its running scorecard.
Episodic Every prior eval run per agent: the full regression history.
Semantic Gold sets, rubrics, red-team attack library, pass thresholds.
Procedural Which eval suites matter for which agent class.
Store Eval result warehouse + immutable run ledger
Orchestration
orchestrator-worker MCPA2A
Harness · Managed Agents: orchestrator spawning parallel eval workers per suite; sandboxed code-exec for scoring; results persisted to an immutable eval ledger.
Tools
{ } CI/CD pipeline API ⌕ Gold-set + rubric store Retrieval ⇄ Judge models (LLM-as-judge / agent-as-judge) A2A ›_ Scoring + statistics sandbox Code exec ›_ Red-team prompt generator Code exec
Evals & guardrails
- The eval rig evals itself: judge calibration checked against the board-anchored ground-truth gold set.
- Champion/challenger required before any version promotion; no silent swaps.
- Regression gate is hard: a drop on any safety-critical suite blocks the release.
- Red-team suite refreshed continuously from new jailbreak/abuse patterns.
Offline reflection
Offline analysis of production failures (via the observability agent) folds new cases into the gold sets. Reflexion/SEAL-style self-improvement of the eval suite, run offline, every gold-set change re-derived and gated by an independent judge agent.
Frontier edge
- ▲Formal action-gating: the regression gate is a provable release bound; a candidate that fails any safety-critical suite cannot be promoted, and the verdict is cryptographically signed and replayable.
- ▲Self-improving fleet: production failures harvested fleet-wide are auto-distilled into new gold-set cases (Reflexion/SEAL-style), so the bar rises across every agent between releases.
- ▲World-model simulation: bakes off challenger versions against held-out and synthetically perturbed cases to predict prod behaviour before any traffic touches them.
A sample run
Trigger A new prompt version for the Sanctions Disposition Agent is opened as a PR.
- 1Run the gold-set regression suite; spawn judge workers to score dispositions.
- 2Fire the red-team battery (alias-evasion, prompt-injection in counterparty names).
- 3Run champion/challenger against the live version on held-out true-match cases.
- 4Compile the scorecard and post it back on the PR.
Output PR blocked: the challenger improved false-positive release but missed one gold true-match, a hard fail on a safety-critical suite. Scorecard and the failing case attached for the author.
In numbers
100% of fleet
Agents under regression eval
3,400
Eval runs / day
100%
Regressions caught pre-prod