Code Review Agent

◆ Supervised Specialist

Reads every PR with the project's context in working memory (the diff, the surrounding code, the conventions, the security policy) and posts a precise, low-noise review. It reproduces suspect edge cases in a sandbox to confirm a bug is real before blocking; on security-sensitive paths a second judge agent re-derives the verdict before the merge commits.

Memory

Working The diff, the touched files, the project conventions and the review so far.

Episodic Prior reviews on this repo and recurring issues the team makes.

Semantic Language idioms, the bank's secure-coding standards, the style guide.

Procedural Review playbooks refined from which comments the judge agent upheld vs. overturned.

Store Repo-context retrieval + review-history store

Orchestration

router-fanout MCPA2A

Harness · Managed Agents: session per PR; sandboxed code-exec to run tests and reproduce; context editing trims read files once reasoned over.

Tools

{ } Git / PR platform API ›_ Test + reproduction sandbox Code exec { } Static analysis / SAST API ⌕ Secure-coding standards Retrieval

Evals & guardrails

Comment-acceptance rate tracked; low-signal nitpicking is penalized, not just missed bugs.
Security findings cross-checked by an agent-as-judge before they block a merge.
Cannot approve-and-merge alone on security-sensitive paths; a second judge agent must re-derive the verdict.
All review runs traced and fed to AgentOps for drift detection.

Offline reflection

Offline replay of which review comments the judge agent upheld vs. overturned, refining the review playbook to cut noise. Consolidation job, not live learning.

Frontier edge

▲Causal reasoning: traces how a diff could break behaviour downstream (a missing idempotency guard double-charging on retry), not just pattern-matching style nits.
▲Continual learning: eval-gated playbook self-edits from which comments the judge agent upheld vs. overturned cut the noise the fleet ignores.
▲World-model simulation: reproduces the suspect edge case in a sandbox to confirm the bug is real before it ever blocks a merge.

A sample run

Trigger A PR touches the payment authorization service.

1Pull the diff and surrounding code; load the secure-coding standard into context.
2Run the test suite in the sandbox; reproduce a failing edge case the author missed.
3Spot a missing idempotency guard that could double-charge on retry.
4Post a precise review with the failing test and a suggested fix.

Output A blocking review on the idempotency bug with a reproduction and patch; routine style items batched as non-blocking. A second judge agent re-derives the verdict and the merge commits.

In numbers

1,800

PRs reviewed / day

4 min

Median time-to-first-review

81%

Comment-acceptance rate

Handoffs

Fed by ← Change Implementation Agent

Hands to → Eval Harness Agent

Across ⇢ Cybersecurity → SOC for confirmed code-level vulnerabilities