Operations
How to control and monitor AI agents in production
A practical guide to monitoring, routing, escalation, audit trails, and execution control for production AI agents.
Production AI operations need both monitoring and control. One tells you what happened, the other decides what can happen next.
Key takeaways
- Monitoring tells you what happened. Control decides what happens next. You need both.
- Named ownership is the single most important operating decision - every action has a human on the other end.
- Role-based routing beats "first to click" every time. It also produces a much cleaner audit trail.
- Every approval needs a deadline, an escalation target, and a defined terminal state on timeout.
- The business context on each approval is what turns a 10-minute decision into a 2-minute decision.
Monitoring vs control
Monitoring tells you what the agent did, how long it took, and where it failed. It is a read-only view of the past - essential, but not sufficient.
Control decides whether a risky action should run, who can approve it, and how exceptions are escalated. It is a write operation on the present - and it is the layer that actually prevents incidents.
Metrics that matter
- Approval rate by workflow
- Escalation rate by owner or role
- Timeout and expired request rate
- Decision latency - P50 and P95
- Callback delivery success
- Audit trail completeness (every action traceable to a named human)
What an operating model needs
- Named ownership for every gated action type
- Role-based routing - refunds to CS, compensation to HR, ledger writes to finance
- Clear reject and fallback paths so the agent has something to do on no
- SLA targets per action class with escalation rules when missed
- Audit logs that tie decisions back to business context, not just technical ids
How to design the escalation tree
A good escalation tree is short and named. Two levels is usually enough: primary owner (CS lead, finance lead) with a short deadline, then an accountable fallback (manager, on-call) with a longer one. Beyond that, you are describing an organizational problem, not a technical one.
- Level 1: role-based primary owner, 5-15 minute deadline.
- Level 2: named fallback, 15-60 minute deadline.
- Level 3: terminal state (reject, expire, or fail to a safe default).
What to put in the audit trail
- The request id, workflow id, and run id
- The reviewer identity and decision timestamp
- The business object - order id, account id, invoice id
- The action in human language
- The comment (approval or rejection reason)
- Callback delivery status
The day-two operating question
Once the system is live, the ongoing question is not "did the agent behave?" It is "is the approval model still calibrated?" Too many approvals means reviewers are burning out; too few means you shipped an incident you have not yet detected. Review the approval-by-workflow and escalation-by-role metrics monthly and adjust thresholds accordingly.
Frequently asked questions
What is the difference between agent observability and agent control?
Observability helps you inspect behavior. Control decides how risky actions are allowed, approved, rejected, or escalated.
Who should own agent operations?
Shared ownership. Platform teams own the infrastructure; business domain leads (CS, finance, HR) own the approval policy and escalation rules for their workflows.
How do I measure whether oversight is working?
Track approval and escalation rates by workflow, decision latency, and incident rate. If approvals are always approved with no comment, oversight is probably rubber-stamp - tighten the gate.
What should happen on approval timeout?
Either escalate to a named fallback or land in a safe terminal state (reject or expire). Indefinite waits are the most common operational bug.
How does Contro1 fit?
Contro1 is the approval and escalation layer that implements named ownership, role-based routing, deadlines, escalation trees, and audit trails so you do not rebuild them per framework.