The Agentic Bank
Reference

The patterns behind the agents

Every agent in this bank is configured with the same vocabulary. Here's what those words actually mean today: the deployed, the emerging, and the honestly-still-research, with primary sources so you can check the homework.

Memory

Agents that only live inside a context window forget everything between runs. Real systems give them tiers.

Practitioners borrow from cognitive science: working memory (what's in the context window right now), and long-term memory split into episodic (past interactions), semantic (consolidated facts and policy) and procedural (learned playbooks).

In production this looks like Anthropic's file-based memory tool paired with context editing and compaction to fight "context rot"; Letta's OS-style tiered memory (core / recall / archival); and mem0-style hybrids that combine a vector store with a knowledge graph so an agent can do multi-hop reasoning over what it knows.

Harnesses & the agent loop

The model is the brain. The harness is everything that turns it into an agent that can actually do work.

An agent harness wraps the model with tool execution, context and memory management, state persistence, error handling and guardrails. The loop itself is small … gather context → take action → verify results, sometimes named ReAct or OODA. The capability and the safety live in the harness.

Anthropic's Managed Agents is a clean reference: it separates the session (an append-only event log that lives outside the context window and supports rewind/resume), the harness (the loop calling the model and routing tool calls), and the sandbox (where tools actually execute). Decouple the brain from the hands.

Orchestration & multi-agent

One agent rarely does a complex job alone. The dominant production pattern is orchestrator–worker.

A lead/orchestrator agent plans and spawns several specialised workers in parallel, each with a fresh context window and a scoped prompt, then merges their results. Other shapes: routers that triage and dispatch, pipelines, swarms, and agent-as-judge evaluators.

Agents coordinate over two protocols: MCP (Model Context Protocol) connects an agent to tools; A2A (Agent2Agent) lets agents delegate to one another; and AP2 extends A2A for agent-initiated payments … directly relevant to a bank. MCP and A2A are now stewarded under the Linux Foundation.

Tool use & MCP

Agents act through tools: APIs, retrieval, code execution, and … when the system is legacy … the screen itself.

Beyond plain tool-calling, two recent moves matter. Computer use lets an agent drive a GUI like a person, which is how it reaches the legacy systems a bank can't replace overnight. And code execution with MCP has the model write code that calls MCP tools as APIs in a sandbox … loading tools on demand and filtering data before it ever hits the context window (a reported ~98% token reduction on large tool sets).

Evals, guardrails & observability

You cannot deploy an autonomous agent into a regulated bank without measuring it continuously.

Automated evaluation has moved from LLM-as-judge to agent-as-judge … an agent with tools and memory evaluating another agent's whole action chain, not just its final answer. Production systems add gold-set precision/recall, champion/challenger before any change goes live, real-time guardrails that block or escalate, drift detection, and immutable audit logs.

Observability has standardised on OpenTelemetry GenAI semantic conventions … span-level tracing of every agent, tool and model call … so banks avoid vendor lock-in and keep an exam-ready record.

Offline reflection

The honest version of "agents that dream": offline consolidation and self-reflection, not a magic switch.

Between runs, an agent can replay what happened and turn it into lessons. Reflexion stores verbal self-critique in an episodic buffer so the agent improves on retry without any weight update. SEAL (MIT, 2025) is the frontier signal … a model that generates its own "self-edits" and updates its own weights via reinforcement learning.

This is real research, and a credible near-term capability for consolidating a fleet's experience overnight. It is not a shipped product you flip on, and anyone who tells you otherwise is selling something.

The frontier: long-horizon autonomy

The single most useful number for an executive is how long a task an agent can finish on its own.

METR measures the time horizon of agents … the length of human task they complete at 50% reliability … and finds it has been doubling roughly every seven months, passing the one-hour mark in early 2025. Extrapolate that curve and the question stops being "can an agent do this task" and becomes "which desk is short enough to hand over next."

The named hard problems that remain … continual learning without catastrophic forgetting, world models, architectural efficiency … are exactly what the reflection work above is reaching toward.