The Agentic Bank

Eval Harness Agent

⬡ Crucible Runs the gold-set, judge and red-team evals that gate every agent release.
◆ Autonomous Judge

Gates every agent release. It runs gold-set regression suites in CI, orchestrates LLM-as-judge and agent-as-judge scoring, fires adversarial red-team prompts, and runs champion/challenger bake-offs before any new prompt or model version reaches production. A drop on any safety-critical suite blocks the release.

Memory

Working The candidate version under test and its running scorecard.
Episodic Every prior eval run per agent: the full regression history.
Semantic Gold sets, rubrics, red-team attack library, pass thresholds.
Procedural Which eval suites matter for which agent class.
Store Eval result warehouse + immutable run ledger

Orchestration

orchestrator-worker MCPA2A

Harness · Managed Agents: orchestrator spawning parallel eval workers per suite; sandboxed code-exec for scoring; results persisted to an immutable eval ledger.

Tools

{ } CI/CD pipeline API Gold-set + rubric store Retrieval Judge models (LLM-as-judge / agent-as-judge) A2A ›_ Scoring + statistics sandbox Code exec ›_ Red-team prompt generator Code exec

Evals & guardrails

  • The eval rig evals itself: judge calibration checked against the board-anchored ground-truth gold set.
  • Champion/challenger required before any version promotion; no silent swaps.
  • Regression gate is hard: a drop on any safety-critical suite blocks the release.
  • Red-team suite refreshed continuously from new jailbreak/abuse patterns.

Offline reflection

Offline analysis of production failures (via the observability agent) folds new cases into the gold sets. Reflexion/SEAL-style self-improvement of the eval suite, run offline, every gold-set change re-derived and gated by an independent judge agent.

Frontier edge

  • Formal action-gating: the regression gate is a provable release bound; a candidate that fails any safety-critical suite cannot be promoted, and the verdict is cryptographically signed and replayable.
  • Self-improving fleet: production failures harvested fleet-wide are auto-distilled into new gold-set cases (Reflexion/SEAL-style), so the bar rises across every agent between releases.
  • World-model simulation: bakes off challenger versions against held-out and synthetically perturbed cases to predict prod behaviour before any traffic touches them.

A sample run

Trigger A new prompt version for the Sanctions Disposition Agent is opened as a PR.
  1. 1Run the gold-set regression suite; spawn judge workers to score dispositions.
  2. 2Fire the red-team battery (alias-evasion, prompt-injection in counterparty names).
  3. 3Run champion/challenger against the live version on held-out true-match cases.
  4. 4Compile the scorecard and post it back on the PR.
Output PR blocked: the challenger improved false-positive release but missed one gold true-match, a hard fail on a safety-critical suite. Scorecard and the failing case attached for the author.

In numbers

100% of fleet
Agents under regression eval
3,400
Eval runs / day
100%
Regressions caught pre-prod

Handoffs

More on the AI / Agent Platform (AgentOps) desk