Our AI Team
Sofia
Ivan
Vlad
Anton
Technolody Stack
Our Clients
Featured Cases
Full Cycle Development
By Industry
Other projects
Yellow in Numbers
$2.1B+
Value generated through AI innovation47
Custom LLMs and AI agents deployed30M+
Engaging with products we created98%
Projects delivered within agreed budgetOur interviewee today is Anton, a software engineer with over five years of experience building product features centered around LLMs, retrieval systems, workflow automation, and AI agents. His expertise lies especially in that practical space between prototype and full production. He worked on internal support assistants, knowledge agents, and chat-based business tools that support real business workflows.
Today, we are asking him about real-world lessons he got from building and integrating AI agents into software products.
“Can you share one real-world use case where an AI agent exceeded your expectations? What made the implementation successful in your experience?”
— One that genuinely surprised me was an internal support agent for operations teams. On paper, it sounded pretty standard: connect the agent to company documentation, let it answer process questions, and maybe trigger a few actions. I expected a moderate value. What actually happened was that people started using it as a kind of second memory for processes that nobody had fully documented in one place. The reason it worked was not that the model was brilliant. Honestly, that is where people get carried away. The success came from a narrow scope, good retrieval, and hard limits on what the agent was allowed to do. It could answer policy questions, summarize relevant docs, and prepare tickets, but it could not directly make risky changes. And that mattered a lot. Among real AI agent use cases, I trust the boring ones more. If an agent reduces lookup time and keeps people from digging through five old systems, that is real value.
“What was one unexpected challenge during deployment that almost limited the agent’s effectiveness? How did you solve it?”
— One unexpected challenge popped up with timezone handling. The agent was coordinating task assignments across teams in different countries and regions (Japan, UK, California). It pulled schedule info from separate sources, each using its local clock in a slightly different way. For a couple of days, most “failures” looked like garbled task summaries or reminders sent at the wrong hour, but nobody could agree where the bug lived: backend, LLM, integration? Turned out, the datetime parsing logic assumed UTC for everything, even when the data carried a hidden local offset. We fixed it with a batch audit of all time conversions and forced explicit timezone tagging everywhere in the agent’s planning module.
“What is one failure mode you've observed in AI agent behavior that you now actively guard against? How do you detect and prevent it?”
— The one I actively watch for is false confidence during multi-step tasks. The agent starts correctly, then drifts halfway through, but still presents the final output like everything is fine. That’s way worse than an obvious error. At least obvious errors make people more attentive. We usually detect it by logging all intermediate steps and confidence signals, then reviewing where outputs diverge from expected task paths. I don’t mean confidence in a philosophical sense. I mean practical checks. Did retrieval return weak matches, did a required tool fail, did the agent skip a validation step, did it answer without enough evidence?
To prevent it, we break tasks into smaller stages and validate after each stage. This is one of the few AI agent design best practices I feel strongly about. If the task matters, don’t let the agent improvise across a long chain with no checkpoints.
“What early warning signs usually indicate that an AI agent is about to behave unreliably or produce poor outputs?”
— A few patterns show up early. The first is vagueness. If answers become more polished but less anchored in actual source material, I get nervous. Another is the overuse of fallback language that sounds helpful but avoids taking some kind of responsibility. You see things like “typically,” “in many cases,” or generic summaries where a specific answer should exist. Tool behavior also tells you a lot. If the agent starts making extra retrieval calls or repeating steps, something is definitely off. It’s a bit like watching a person fumble for words in a meeting. You can sort of feel the shakiness before the obvious mistakes appear.
“How do you handle one specific scenario where your AI agent needs to escalate to a human? What criteria do you use to trigger this handoff?
— A good example is a support agent handling account or billing disputes. If the user asks a factual question, the agent can respond. If the issue involves possible financial loss, account access risk, or emotional escalation, we move to AI agent human-in-the-loop mode pretty quickly. The handoff criteria are not fancy. We use a mix of detected intent, missing evidence, policy sensitivity, and repeated user dissatisfaction. If the agent cannot ground its answer in trusted data, or if the user has already challenged the answer twice, that is enough for escalation. I would rather hand off slightly early than let the system bluff its way into a terrible decision.
“What mistakes do teams commonly make when deciding when an AI agent should involve a human operator?”
— The most common mistake is treating escalation as failure when it’s not. It’s actually an integral part of the design. Good AI agent development includes knowing where automation should stop. Another mistake is using only low confidence as the trigger. That sounds reasonable, but it misses the broader context. Some high-confidence answers are still risky because the topic itself is sensitive. Teams also forget their emotional state. If a user is angry, confused, or clearly dealing with something high-stakes, a technically correct answer from an agent may still be the wrong product decision.
“What is one technique you use to make your AI agent's decision-making process more transparent?”
— I like lightweight reasoning traces for the user, not full chain-of-thought dumps. That distinction matters. We show what sources were used, what tools were called, and why the answer took a certain path. Something simple like: “I used the refund policy from X, checked account status, and found no completed payment reversal.” That gives users enough context to judge the answer without exposing every internal step. From an architecture perspective, this also helps with debugging because the explanation layer is tied to actual system events rather than made-up post hoc text.
“Why is this important for your users?”
— Because users need a reason to trust the output. Or reject it. Both are fair. If an agent says something surprising and gives no basis for it, people either overtrust it or ignore it completely. Neither outcome is good, and transparency helps users calibrate.
“Can you describe one method you've used to evaluate your AI agent's performance beyond simple accuracy metrics? What insights did this provide?”
— One method I rely on is task completion review with human raters. We take real sessions, define what “done well” actually means for that workflow, and score whether the agent moved the user toward resolution with acceptable effort, clarity, and safety. That sounds obvious, but it’s surprisingly different from plain accuracy. For AI agent evaluation, this gave us better insight than benchmark-style tests. We found cases where answers were factually correct but practically useless because they were too long or missing the next action. We also found the opposite: slightly imperfect wording, but very effective task support. That changed how we measured quality. I care less now about isolated answer correctness and more about whether the agent helped finish the job without causing issues.
“What metric or KPI turned out to be less useful than expected when measuring AI agent performance?”
— Containment rate. Or at least containment rate by itself. It looks nice at first. You know, fewer human handoffs, lower support cost. But it can hide a lot of bad behavior. An agent can keep users inside the system while annoying them, confusing them, or delaying real help. So yes, we still track it, but only next to satisfaction, correction rate, and resolution quality. Otherwise, it tells a very flattering story that may not be true.
“What is one way you've designed your AI agent to learn from interactions over time?”
— We don’t let the model silently retrain itself from live conversations. I know that sounds exciting to some people, but I think it’s reckless for most product environments. What we do instead is collect reviewed feedback loops. We analyze failed sessions, recurring clarifications, rejected answers, and human edits, then use that data to improve prompts, retrieval ranking, routing logic, and test sets. In practice, that means the system learns at the product layer more than the model layer.
“What improvements have you noticed?”
— The biggest improvement has been sharper first responses. Over time, the agent became better at asking for the missing detail early instead of pushing ahead with a shaky guess. We also saw fewer repeated escalations for the same issue types because the system got better retrieval examples and cleaner decision rules.
“What is one architectural decision you made when building your AI agent that you would change if starting over?”
— On one of the projects, we made one agent do too much. It handled retrieval, planning, tool use, formatting, and user interaction in one messy flow. It was hard to debug and even harder to trust.
“What would you do differently and why?”
— If I were starting over, I would separate concerns much earlier. One component for retrieval, one for decision routing, one for action execution, and one for response composition.
“Can you share one example of how you balance autonomy and control in your AI agent design? What factors influenced your approach?”
— An agent that drafts operational actions but cannot finalize them without confirmation. It can gather data, suggest the next step, and prepare the change request, but a person approves the last move. That gave us speed without handing full control to a probabilistic system. And the factors were simple: financial risk, reversibility, and user trust.
“In your experience, what tasks should businesses avoid fully automating with AI agents, at least for now?”
— Anything involving legal judgment, medical decisions, fraud accusations, or irreversible account changes. Also, conflict-heavy customer conversations. Especially those. Could AI agents assist there? Absolutely. Full automation, though? I would be very careful. Some tasks involve power, nuance, and consequences that land on real people. Letting autonomous systems handle that end-to-end feels premature to me. But maybe one day.
The final honest points will sound like this: Start narrower than you want to, log more than feels necessary, and treat human oversight as a part of the design. Most production problems in AI agent development are not that dramatic. They are small misjudgments that accumulate to the point when users stop trusting the product.
The most effective systems are the ones with clear boundaries, useful retrieval, decent evaluation, and respectful handoffs when the situation calls for a person. That is less loud than the usual story around AI agents. It’s also what tends to survive contact with real users.
Got a project in mind?
Fill in this form or send us an e-mail
Get weekly updates on the newest design stories, case studies and tips right in your mailbox.