Operations

AI agent approvals and escalations

Design approval workflows, timeout handling, fallback reviewers, SLA escalation, and signed callback paths for production AI agent systems.

Approvals are only half the system. Escalations, ownership, and timeout handling are what make approvals operationally reliable.

Key takeaways

  • An approval without escalation is a hang waiting to happen.
  • Ownership is named by role, not person - people go on vacation, roles do not.
  • Every approval needs a deadline and a defined terminal state on timeout.
  • Callback delivery is where most integrations silently fail - sign it and verify it.
  • The same policy works across frameworks when it is expressed as rules, not code.

What a complete approval system needs

  • A clear owner or required role
  • SLA targets per action class
  • Escalation levels with named fallbacks
  • Signed callback delivery with replay protection
  • A safe terminal outcome if nobody responds
  • An audit record that ties decisions back to business context

Why escalation design matters

If the right person does not respond, the system needs to know who is next, how long to wait, and whether the workflow should fail closed or fail safe. A request with no escalation path is an operational liability - the customer is waiting, the agent is paused, and nobody is notified that the clock is ticking.

Designing deadlines

Deadlines express business urgency. A refund-review deadline of 15 minutes tells the agent "do not keep the customer waiting past this point." A compliance review deadline of 4 hours tells the system "this can wait for business hours, but not longer."

  • Urgent customer-facing: 5-15 minutes.
  • Normal internal decisions: 30 minutes to 2 hours.
  • Compliance or cross-team reviews: 4-24 hours.
  • Anything longer than a day should probably not be automated at all.

Callback safety

The callback is the path that resumes your workflow after a human decides. It is also the most common place for subtle bugs - a misconfigured proxy, a retried webhook, or a developer hitting the resume endpoint by hand.

  • Sign every callback with HMAC and verify on receipt.
  • Reject callbacks older than your clock skew window.
  • Make your resume endpoint idempotent - same request id, same outcome.
  • Log every callback attempt with its delivery status.

Webhooks reference

Cross-functional example

A finance agent can route an invoice discrepancy to a finance manager first with a 30-minute deadline. On timeout, it escalates to the CFO with a 2-hour deadline. On second timeout, the workflow expires - the agent takes no action, a ticket opens for manual review, and the customer is told the case is under review.

The policy that encodes all of this lives in one place. The agent that raised the request does not need to know about escalation - it only needs to know how to pause and how to resume.

What goes wrong in production

  • Approvals sit in a shared channel, nobody is the named owner - first to click wins, quality collapses.
  • No deadline - requests pile up, the workflow appears "stuck."
  • Callback endpoint is unauthenticated - retries and replays cause double actions.
  • Escalation hits the same inbox as the original request - nobody new sees it.
  • Audit trail stores technical ids but not business context - compliance reviews go badly.

Frequently asked questions

What happens if nobody responds to an approval request?

A production-ready system should escalate according to SLA and then end in a safe explicit state such as expired or denied.

Who should own AI agent approvals?

The person or role that already owns the business decision, not the person who happened to deploy the agent.

How many escalation levels is enough?

Two is usually enough: a role-based primary owner and a named fallback. Beyond that you are describing an organizational problem, not a technical one.

Should callbacks retry?

Yes, with exponential backoff and signed payloads. Make the receiver idempotent so retries are safe.

How do I handle a rejection?

Define a fallback path in the workflow. The agent should have something deterministic to do on no - log the reason, notify the requester, and end the run cleanly.