◆ Supervised Specialist
Reads every PR with the project's context in working memory (the diff, the surrounding code, the conventions, the security policy) and posts a precise, low-noise review. It reproduces suspect edge cases in a sandbox to confirm a bug is real before blocking; on security-sensitive paths a second judge agent re-derives the verdict before the merge commits.
Memory
Working The diff, the touched files, the project conventions and the review so far.
Episodic Prior reviews on this repo and recurring issues the team makes.
Semantic Language idioms, the bank's secure-coding standards, the style guide.
Procedural Review playbooks refined from which comments the judge agent upheld vs. overturned.
Store Repo-context retrieval + review-history store
Orchestration
router-fanout MCPA2A
Harness · Managed Agents: session per PR; sandboxed code-exec to run tests and reproduce; context editing trims read files once reasoned over.
Tools
{ } Git / PR platform API ›_ Test + reproduction sandbox Code exec { } Static analysis / SAST API ⌕ Secure-coding standards Retrieval
Evals & guardrails
- Comment-acceptance rate tracked; low-signal nitpicking is penalized, not just missed bugs.
- Security findings cross-checked by an agent-as-judge before they block a merge.
- Cannot approve-and-merge alone on security-sensitive paths; a second judge agent must re-derive the verdict.
- All review runs traced and fed to AgentOps for drift detection.
Offline reflection
Offline replay of which review comments the judge agent upheld vs. overturned, refining the review playbook to cut noise. Consolidation job, not live learning.
Frontier edge
- ▲Causal reasoning: traces how a diff could break behaviour downstream (a missing idempotency guard double-charging on retry), not just pattern-matching style nits.
- ▲Continual learning: eval-gated playbook self-edits from which comments the judge agent upheld vs. overturned cut the noise the fleet ignores.
- ▲World-model simulation: reproduces the suspect edge case in a sandbox to confirm the bug is real before it ever blocks a merge.
A sample run
Trigger A PR touches the payment authorization service.
- 1Pull the diff and surrounding code; load the secure-coding standard into context.
- 2Run the test suite in the sandbox; reproduce a failing edge case the author missed.
- 3Spot a missing idempotency guard that could double-charge on retry.
- 4Post a precise review with the failing test and a suggested fix.
Output A blocking review on the idempotency bug with a reproduction and patch; routine style items batched as non-blocking. A second judge agent re-derives the verdict and the merge commits.
In numbers
1,800
PRs reviewed / day
4 min
Median time-to-first-review
81%
Comment-acceptance rate
Handoffs
Hands to → Eval Harness Agent
Across ⇢ Cybersecurity → SOC for confirmed code-level vulnerabilities