The boring parts make AI agents production-ready
We have all been in the AI loop of hell, where the agent keeps going in circles to spoon-feed us the same knowledge base article for the third time.
Support is an obvious use case for agents because the raw material is already there: documentation, policies, previous tickets, and known answers. Put an agent on top of that, and the first demo usually looks impressive.
That's a proof of concept, not a production-ready agent. You need to know whether the answer was correct, whether a changed knowledge base article broke existing flows, and whether the changed instructions improved the agent rather than merely making it sound more confident.
That is why you need a way to evaluate it. Start with a golden dataset: representative questions, expected answers, edge cases, and examples where the agent should refuse or escalate. Without that, every change is mostly based on gut feeling.
You need the logs to be useful. Not just “the agent answered”, but what the customer asked, what context was retrieved, what answer was sent, and where the behaviour went wrong. If you cannot review the failures, you cannot operate the system.
The agent needs to be able to say "I don't know" and hand it over to a human.
Answering from a knowledge base is a proof of concept. An agent that can escalate, be evaluated, and inspected is a production system.