Vaultak
Engineering

How to Monitor AI Agents in Production

April 15, 2026 · 8 min read · By Samuel Oladji

Here is a pattern that plays out more often than it should. A team ships an AI agent. It runs fine for a few days. Then something goes sideways, an unusual sequence of actions, a record that should not have been touched, an email that should not have been sent. The team finds out not from their monitoring stack, but from a customer complaint.

The post-mortem usually uncovers the same gap: the agent had logging, but nobody was watching the right things. They could see that the agent ran. They could not see what it actually did, or whether what it did was within acceptable bounds.

This is the difference between logging and monitoring. Logging tells you what happened. Monitoring tells you when something that happened should not have.

This post is about how to build the second thing for AI agents, practically, without a six-month observability project.

Why standard logging is not enough

If you are using LangChain, CrewAI, or the OpenAI Assistants API, you probably already have some form of logging set up. You can see token counts, latencies, which tools were called. That is useful for debugging. It is not sufficient for production safety.

The problem is that standard logging is retrospective. You look at it after something goes wrong. What you actually need for an agent running in production is something closer to what a circuit breaker does in electrical systems: something that catches a problem at the moment it is happening, before the damage compounds.

An agent that runs a bulk database update affecting 10,000 records will show up fine in your latency graphs. The action completed successfully. From a logging perspective, everything looks normal. From a business perspective, you just lost a week of work.

What you actually need to monitor

There are five things worth tracking for any agent running in production. Not all of them are obvious.

1. Action type and target

Every action your agent takes should be categorized. Is it a read or a write? Is it touching a production system or a sandboxed one? Is it sending an outbound message? The categorization sounds basic but most teams skip it, which means they have no way to set thresholds or alerts without going back through raw logs.

2. Blast radius per action

This is the one most teams miss entirely. Blast radius measures how many records or systems a single action affects. A write operation that touches one row is very different from a write operation that touches every row matching a condition. Both look like "write operations" in your logs. Only one of them is a potential disaster.

Before any bulk operation executes, you want a count. If that count exceeds a threshold you have defined, the action should either be blocked or require explicit approval.

3. Policy violations

Not every agent action that completes successfully is an action that should have been allowed. You need a layer that checks each action against a set of rules before it runs, not logs it afterward. Rules like: no deletes on production tables, no outbound emails to more than 50 recipients, no API calls to domains outside an approved list.

Without this layer, your monitoring is purely observational. You can see what happened but you cannot prevent it from happening again.

4. Behavioral drift

Agents that run continuously develop patterns. A customer service agent that normally processes 20 to 30 tickets per hour and suddenly starts processing 400 is exhibiting behavioral drift. It might be fine. It might be a runaway loop. Without a baseline and a threshold, you have no way to tell the difference until the damage is done.

5. State before high-risk actions

This is less about monitoring and more about recovery, but they belong together. Before any action that carries meaningful risk, snapshot the relevant state. If the action turns out to have been a mistake, you want a recovery path that takes minutes, not days.

A practical monitoring setup

Here is what a minimal but effective production monitoring setup looks like for a LangChain or CrewAI agent.

The first layer is instrumentation at the action level. Every tool call your agent makes should pass through a checkpoint that records the action type, the target resource, the payload size, and the timestamp.

from vaultak import VaultakClient

client = VaultakClient(api_key="your_api_key")

# Pre-execution check on every agent action
result = client.check(
    action="update",
    resource="users.accounts",
    payload={"record_count": len(affected_records)}
)

if result.blocked:
    # Action violates a defined policy
    raise PolicyViolationError(result.reason)

# Proceed with action
update_records(affected_records)

The second layer is policy definition. Start with three policies and expand from there:

# Block bulk operations above a safe threshold
client.policies.create({
    "name": "blast_radius_limit",
    "rule": "block",
    "condition": {
        "action": "update",
        "record_count_exceeds": 25
    }
})

# Block deletes on production resources entirely
client.policies.create({
    "name": "no_production_deletes",
    "rule": "block",
    "condition": {
        "action": "delete",
        "resource_pattern": "production.*"
    }
})

# Alert on high-volume outbound email
client.policies.create({
    "name": "email_volume_alert",
    "rule": "alert",
    "condition": {
        "action": "send_email",
        "recipient_count_exceeds": 50
    }
})

The third layer is rollback. Before any action that touches more than one record, snapshot the affected state.

# Snapshot before a high-risk operation
snapshot_id = client.snapshot(
    resource="users.accounts",
    record_ids=[r.id for r in affected_records]
)

# If something goes wrong after execution
client.rollback(snapshot_id)

These three layers together give you meaningful production monitoring: you know what the agent is doing, you have defined what it is not allowed to do, and you have a recovery path if something slips through.

Where to send the signals

The monitoring data from your agent should flow into whatever tools your team already uses. Setting up a separate dashboard that nobody checks is worse than nothing, it creates false confidence.

If your team lives in Slack, route agent alerts to a dedicated channel. If you use PagerDuty for on-call, wire critical policy violations into your existing rotation. If you have Splunk or Datadog, the agent's action stream should be another data source in your existing SIEM setup, not a separate tool.

The goal is to make agent monitoring as native to your existing workflow as application performance monitoring. It should not require anyone to open a new tab to find out whether the agent behaved appropriately in the last hour.

The monitoring gap most teams discover too late

The thing that surprises most teams when they first instrument an agent properly is how much was happening that they had no visibility into. Not because the agent was misbehaving, often it was not. But the actions it was taking, the data it was touching, the volume it was operating at, were all invisible until they started looking.

Running an agent in production without action-level monitoring is roughly equivalent to running a web application with no error tracking. You will find out about problems eventually. You will just find out the hard way.

The good news is that the instrumentation is not a large project. Five lines of code and a handful of policy definitions get you most of the way there. The hard part is not the implementation, it is deciding to do it before something goes wrong rather than after.

Add monitoring to your agent in 5 minutes
Real-time action monitoring, policy enforcement, and rollback. Free tier available.
Get started free →