Q: Memory

Practitioners borrow from cognitive science: working memory (what's in the context window right now), and long-term memory split into episodic (past interactions), semantic (consolidated facts and policy) and procedural (learned playbooks). In production this looks like Anthropic's file-based memory tool paired with context editing and compaction to fight "context rot"; Letta's OS-style tiered memory (core / recall / archival); and mem0-style hybrids that combine a vector store with a knowledge graph so an agent can do multi-hop reasoning over what it knows.

Question 1

Memory

Accepted Answer

Practitioners borrow from cognitive science: working memory (what's in the context window right now), and long-term memory split into episodic (past interactions), semantic (consolidated facts and policy) and procedural (learned playbooks).

In production this looks like Anthropic's file-based memory tool paired with context editing and compaction to fight "context rot"; Letta's OS-style tiered memory (core / recall / archival); and mem0-style hybrids that combine a vector store with a knowledge graph so an agent can do multi-hop reasoning over what it knows.

Question 2

Harnesses & the agent loop

Accepted Answer

An agent harness wraps the model with tool execution, context and memory management, state persistence, error handling and guardrails. The loop itself is small … gather context → take action → verify results, sometimes named ReAct or OODA. The capability and the safety live in the harness. Anthropic's Managed Agents is a clean reference: it separates the session (an append-only event log that lives outside the context window and supports rewind/resume), the harness (the loop calling the model and routing tool calls), and the sandbox (where tools actually execute). Decouple the brain from the hands.

Question 3

Orchestration & multi-agent

Accepted Answer

A lead/orchestrator agent plans and spawns several specialised workers in parallel, each with a fresh context window and a scoped prompt, then merges their results. Other shapes: routers that triage and dispatch, pipelines, swarms, and agent-as-judge evaluators. Agents coordinate over two protocols: MCP (Model Context Protocol) connects an agent to tools; A2A (Agent2Agent) lets agents delegate to one another; and AP2 extends A2A for agent-initiated payments … directly relevant to a bank. MCP and A2A are now stewarded under the Linux Foundation.

Question 4

Tool use & MCP

Accepted Answer

Beyond plain tool-calling, two recent moves matter. Computer use lets an agent drive a GUI like a person, which is how it reaches the legacy systems a bank can't replace overnight. And code execution with MCP has the model write code that calls MCP tools as APIs in a sandbox … loading tools on demand and filtering data before it ever hits the context window (a reported ~98% token reduction on large tool sets).

Question 5

Evals, guardrails & observability

Accepted Answer

Automated evaluation has moved from LLM-as-judge to agent-as-judge … an agent with tools and memory evaluating another agent's whole action chain, not just its final answer. Production systems add gold-set precision/recall, champion/challenger before any change goes live, real-time guardrails that block or escalate, drift detection, and immutable audit logs. Observability has standardised on OpenTelemetry GenAI semantic conventions … span-level tracing of every agent, tool and model call … so banks avoid vendor lock-in and keep an exam-ready record.

Question 6

Offline reflection

Accepted Answer

Between runs, an agent can replay what happened and turn it into lessons. Reflexion stores verbal self-critique in an episodic buffer so the agent improves on retry without any weight update. SEAL (MIT, 2025) is the frontier signal … a model that generates its own "self-edits" and updates its own weights via reinforcement learning.

This is real research, and a credible near-term capability for consolidating a fleet's experience overnight. It is not a shipped product you flip on, and anyone who tells you otherwise is selling something.

Question 7

The frontier: long-horizon autonomy

Accepted Answer

METR measures the time horizon of agents … the length of human task they complete at 50% reliability … and finds it has been doubling roughly every seven months, passing the one-hour mark in early 2025. Extrapolate that curve and the question stops being "can an agent do this task" and becomes "which desk is short enough to hand over next." The named hard problems that remain … continual learning without catastrophic forgetting, world models, architectural efficiency … are exactly what the reflection work above is reaching toward.

The patterns behind the agents