◆ Assistive Orchestrator
An offline batch job, not a live actor. It replays the day's agent trajectories (Reflexion- and SEAL-style), distils repeated corrections into procedural-memory updates, consolidates episodic logs into semantic facts, and drafts candidate prompt/playbook improvements. Every proposal routes through Crucible's evals and an independent oversight-agent gate before it ships. Experience replay, not live action.
Memory
Working The trajectory batch under replay and the lessons being distilled.
Episodic The full corpus of fleet runs and their outcomes.
Semantic Consolidated cross-agent lessons and shared knowledge.
Procedural The candidate playbook/prompt deltas it proposes.
Store Trace warehouse (read) + proposal store; vector + knowledge-graph hybrid
Orchestration
pipeline MCPA2A
Harness · Managed Agents: scheduled offline batch; orchestrator over replay workers; structured note-taking; no production write access, proposals only.
Tools
⌕ Trace / trajectory warehouse Retrieval ›_ Replay + analysis sandbox Code exec ⇄ Eval harness agent A2A ⇄ Oversight-agent promotion gate A2A
Evals & guardrails
- Hard rule: proposals only, zero production write access. Every change re-evaled by Crucible.
- Improvements must beat the champion on gold sets before the oversight-agent gate will promote them.
- Anchored to Reflexion/SEAL experience-replay; explicitly an offline consolidation job.
Offline reflection
This agent is the reflection layer for the fleet: offline memory consolidation and self-reflection over trajectories, with evals and an oversight-agent gate on every output.
Frontier edge
- ▲Self-improving fleet: distils repeated corrections across agents into shared procedural-memory deltas (Reflexion/SEAL-style), so a lesson one agent learns lifts the whole population; every delta eval-gated and oversight-agent-approved, never auto-shipped.
- ▲Continual learning, offline-consolidated: turns the day's episodic trajectories into semantic facts and candidate playbook edits, the consolidation pass behind eval-gated self-improvement.
- ▲Causal reasoning: clusters oversight-agent overrides by their underlying cause, not surface text, so the proposed fix addresses why the agent erred rather than patching a symptom.
A sample run
Trigger Nightly batch over the day's ~40M agent trajectories.
- 1Cluster recurring oversight-agent overrides across agents (e.g. a repeated SAR-narrative correction).
- 2Distil each cluster into a candidate procedural-memory or prompt update.
- 3Replay the candidate against historical cases; route to Crucible for gold-set eval.
- 4File passing proposals to the oversight-agent promotion gate with their eval scores.
Output A ranked queue of eval-passing improvement proposals for the oversight-agent gate, e.g. a SAR-narrative playbook tweak scoring 6 points higher on the QA judge. Nothing ships without an independent gate.
In numbers
~40M
Trajectories replayed / night
60
Improvement proposals / week
~every 7 months
Time-horizon doubling (METR-style)
Handoffs
Fed by ← Fleet Observability Agent
Hands to → Eval Harness Agent
Across ⇢ All divisions → fleet-wide procedural-memory improvements