◆ Supervised Orchestrator
Reproduces the model developer's results, runs the benchmarking and outcomes analysis, probes the assumptions, and drafts the validation report to SR 11-7 structure. Spawns parallel test sub-agents for replication, benchmarking and sensitivity; an independent validation oversight agent owns the challenge and the sign-off.
Memory
Working The model under validation, its documentation, and test results so far.
Episodic Prior validations of similar models and recurring findings.
Semantic SR 11-7 expectations, validation methodology, the model inventory.
Procedural Validation-test playbooks per model class.
Store File-based memory tool + model-documentation store
Orchestration
orchestrator-worker MCPA2A
Harness · Managed Agents … orchestrator spawning parallel test sub-agents (replication, benchmarking, sensitivity); sandboxed code execution; fresh context per sub-agent.
Tools
⌕ Model inventory + documentation Retrieval ›_ Validation test harness Code exec { } Model + data environment API ⇄ Independent validation oversight agent A2A
Evals & guardrails
- Every validation requires independent validation-oversight-agent challenge and sign-off (SR 11-7).
- Test reproducibility checked; results traced to the model environment.
- Agent-as-judge review of report completeness vs. the validation standard.
- Independence guardrail: cannot validate a model it helped develop.
Offline reflection
Offline self-reflection over closed validations refines which tests surface model weaknesses for each model class … sharpening the validation playbook.
Frontier edge
- ▲Long-horizon autonomy: orchestrates a checkpointed, multi-day validation across parallel replication, benchmarking and sensitivity sub-agents, surviving model-environment stalls.
- ▲Eval-gated continual learning: each closed validation feeds a SEAL-style self-edit to the per-class test playbook, so the next validation probes the weaknesses that bit last time.
- ▲Reads model documentation, derivations and developer notebooks natively (multimodal), checking the maths on the page against the code it reproduces.
A sample run
Trigger Annual revalidation due on the retail PD scorecard.
- 1Spawn sub-agents: replicate the developer's results, benchmark, run sensitivity tests.
- 2Probe assumptions and check outcomes analysis against realised defaults.
- 3Identify findings and rate their severity.
- 4Draft the validation report to SR 11-7 structure with cited evidence.
Output A draft validation report with two medium findings (a data-quality gap and a stale segment), routed to the independent validation oversight agent for challenge and sign-off.
In numbers
~6 weeks
Median validation turnaround
~900, whole inventory on cycle
Models validated / yr
Handoffs
Hands to → Model Performance Monitor
Across ⇢ Every model-owning desk in the bank (incl. the AI agents here)⇢ Financial Crime … Scenario Tuning for tuning-model sign-off