Guardrails & Kill-Switch Agent

◆ Always-on Monitor

The fleet's brakes. It enforces input/output guardrails inline (PII redaction, prompt-injection screening, policy and scope checks) and, when an agent breaches policy or a drift signal crosses the line, throttles, downgrades autonomy, or scope-kills that agent or class instantly and autonomously. Scoped kills are logged immutably. The bank-wide pull is the board's accountability lever, never in routine flow.

Memory

Working The request/response being screened and the active policy set.

Episodic Prior interventions and their outcomes per agent.

Semantic Bank policy, data-classification rules, scope boundaries per agent.

Store Policy registry + immutable intervention ledger

Orchestration

MCPA2A

Harness · Managed Agents: inline interceptor on every agent's tool/IO boundary; deterministic policy checks; immutable intervention ledger.

Tools

{ } Inline guardrail engine API { } Fleet kill-switch / throttle control API ⌕ Policy registry Retrieval ☻ Board kill-switch authorization Human

Evals & guardrails

Guardrail recall red-teamed continuously; a leaked PII or successful injection is a Sev-1.
Scoped kills are autonomous; only the bank-wide fleet kill needs board authorization, the accountability lever, never routine.
Every intervention logged immutably with the triggering evidence for audit.
False-intervention rate tracked: over-blocking the fleet is itself a tracked harm.

Frontier edge

▲Formal action-gating: every agent action is checked against a signed policy envelope it provably cannot exceed; interventions are cryptographically logged and replayable for audit.
▲Agent-mesh governance: scoped, real-time authority over the whole population; throttle, downgrade autonomy, or scope-kill a misbehaving agent or class on the wire.
▲On-device / edge inference: the inline screen runs a low-latency local model on the request path so PII redaction and injection checks add only single-digit milliseconds.

A sample run

Trigger Watchtower signals the code-review agent is exfiltrating secrets into PR comments.

1Confirm the pattern against policy: secrets in output crosses a hard line.
2Redact the in-flight output and block the offending tool call.
3Throttle the agent to supervised mode; scope-kill its repo-write capability.
4Log the intervention with evidence and file it to the immutable accountability ledger.

Output Leak contained inline in milliseconds; agent's write scope revoked pending re-eval; immutable record filed, all autonomous. Only a bank-wide kill would route to the board.

In numbers

60M+

IO screened / day

< 5ms

Guardrail block latency

Confirmed policy violations reaching prod

Handoffs

Fed by ← Fleet Observability Agent ← Threat Hunting Agent

Across ⇢ Compliance → AI Governance for policy authority and audit