Operations

How to control and monitor AI agents in production

A practical guide to monitoring, routing, escalation, audit trails, and execution control for production AI agents.

Production AI operations need both monitoring and control. One tells you what happened, the other decides what can happen next.

Key takeaways

Monitoring tells you what happened. Control decides what happens next. You need both.
Named ownership is the single most important operating decision - every action has a human on the other end.
Role-based routing beats "first to click" every time. It also produces a much cleaner audit trail.
Every approval needs a deadline, an escalation target, and a defined terminal state on timeout.
The business context on each approval is what turns a 10-minute decision into a 2-minute decision.

Monitoring vs control

Monitoring tells you what the agent did, how long it took, and where it failed. It is a read-only view of the past - essential, but not sufficient.

Control decides whether a risky action should run, who can approve it, and how exceptions are escalated. It is a write operation on the present - and it is the layer that actually prevents incidents.

Metrics that matter

Approval rate by workflow
Escalation rate by owner or role
Timeout and expired request rate
Decision latency - P50 and P95
Callback delivery success
Audit trail completeness (every action traceable to a named human)

What an operating model needs

Named ownership for every gated action type
Role-based routing - refunds to CS, compensation to HR, ledger writes to finance
Clear reject and fallback paths so the agent has something to do on no
SLA targets per action class with escalation rules when missed
Audit logs that tie decisions back to business context, not just technical ids

How to design the escalation tree

A good escalation tree is short and named. Two levels is usually enough: primary owner (CS lead, finance lead) with a short deadline, then an accountable fallback (manager, on-call) with a longer one. Beyond that, you are describing an organizational problem, not a technical one.

Level 1: role-based primary owner, 5-15 minute deadline.
Level 2: named fallback, 15-60 minute deadline.
Level 3: terminal state (reject, expire, or fail to a safe default).

What to put in the audit trail

The request id, workflow id, and run id
The reviewer identity and decision timestamp
The business object - order id, account id, invoice id
The action in human language
The comment (approval or rejection reason)
Callback delivery status

AI agent audit trail · What to log for AI agents

The day-two operating question

Once the system is live, the ongoing question is not "did the agent behave?" It is "is the approval model still calibrated?" Too many approvals means reviewers are burning out; too few means you shipped an incident you have not yet detected. Review the approval-by-workflow and escalation-by-role metrics monthly and adjust thresholds accordingly.

Frequently asked questions

What is the difference between agent observability and agent control?

Observability helps you inspect behavior. Control decides how risky actions are allowed, approved, rejected, or escalated.

Who should own agent operations?

Shared ownership. Platform teams own the infrastructure; business domain leads (CS, finance, HR) own the approval policy and escalation rules for their workflows.

How do I measure whether oversight is working?

Track approval and escalation rates by workflow, decision latency, and incident rate. If approvals are always approved with no comment, oversight is probably rubber-stamp - tighten the gate.

What should happen on approval timeout?

Either escalate to a named fallback or land in a safe terminal state (reject or expire). Indefinite waits are the most common operational bug.

How does Contro1 fit?

Contro1 is the approval and escalation layer that implements named ownership, role-based routing, deadlines, escalation trees, and audit trails so you do not rebuild them per framework.