<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"><channel><title>Voxli Blog</title><description>Field notes on testing conversational AI agents — multi-turn failures, tool-calling, hallucinations, and how to catch them before your customers do.</description><link>https://voxli.io/</link><language>en-us</language><item><title>A better model can make your agent worse</title><link>https://voxli.io/blog/a-better-model-can-make-your-agent-worse/</link><guid isPermaLink="true">https://voxli.io/blog/a-better-model-can-make-your-agent-worse/</guid><description>A stronger model with higher scores looks like a free upgrade, until the agent that worked last week starts getting things wrong, quietly. Here is what happened when we ran one agent on two frontier models and changed nothing else.</description><pubDate>Wed, 17 Jun 2026 06:41:00 GMT</pubDate><category>Agent Reliability</category><category>Conversational AI</category><category>AI Quality Assurance</category><category>AI Agent Testing</category><category>Model Behavior</category><category>AI Agents</category><author>Voxli</author></item><item><title>Upfront information dump</title><link>https://voxli.io/blog/upfront-information-dump/</link><guid isPermaLink="true">https://voxli.io/blog/upfront-information-dump/</guid><description>A customer opens your support agent with this:</description><pubDate>Tue, 26 May 2026 11:05:00 GMT</pubDate><category>Failure Modes</category><category>AI Agents</category><category>Conversational AI</category><author>Mahey Qadir</author></item><item><title>Mid-conversation tangent</title><link>https://voxli.io/blog/mid-conversation-tangent/</link><guid isPermaLink="true">https://voxli.io/blog/mid-conversation-tangent/</guid><description>A customer is halfway through a return flow with your agent. They&apos;ve shared the order number, the item and reason for the return. They then pause to ask: &quot;Wait, do you offer…</description><pubDate>Fri, 15 May 2026 14:15:00 GMT</pubDate><author>Voxli</author></item><item><title>The multi-turn failures that prompt evals can&apos;t see</title><link>https://voxli.io/blog/multi-turn-failures/</link><guid isPermaLink="true">https://voxli.io/blog/multi-turn-failures/</guid><description>Most agent failures we see in pilots don&apos;t show up on prompt evals.</description><pubDate>Mon, 27 Apr 2026 14:46:00 GMT</pubDate><category>Agent Reliability</category><category>AI Agents</category><category>AI Agent Testing</category><category>AI Quality Assurance</category><category>Conversational AI</category><category>Support Agent</category><author>Voxli</author></item><item><title>The 10-minute test that stops your agent from canceling real orders</title><link>https://voxli.io/blog/the-10-minute-test-that-stops-your-agent-from-canceling-real-orders/</link><guid isPermaLink="true">https://voxli.io/blog/the-10-minute-test-that-stops-your-agent-from-canceling-real-orders/</guid><description>Last week a failed tool call caused GPT-5.4-mini to cancel a real order simply because a customer asked a question involving cancellation. Here&apos;s a quick test that catches it.</description><pubDate>Tue, 21 Apr 2026 09:34:00 GMT</pubDate><author>Voxli</author></item><item><title>Expertise.ai teams up with Voxli to solve the &quot;absolute insanity&quot; of their AI sales Agent testing workflow</title><link>https://voxli.io/blog/expertise-ai-teams-up-with-voxli/</link><guid isPermaLink="true">https://voxli.io/blog/expertise-ai-teams-up-with-voxli/</guid><description>Expertise.ai is a known disruptor in the AI space, building AI sales agents that guide prospects through personalized flows. Here&apos;s how Voxli untangled their testing workflow.</description><pubDate>Thu, 16 Apr 2026 12:06:00 GMT</pubDate><category>Case Study</category><category>Customer Story</category><author>Mahey Qadir</author></item><item><title>The failed Tool Call when Simulating a Customer Conversation Across Three LLMs</title><link>https://voxli.io/blog/ai-agents-tool-handling/</link><guid isPermaLink="true">https://voxli.io/blog/ai-agents-tool-handling/</guid><description>Recently, to assess AI Agent performance with tool calls, we executed the same multi-turn conversation across the three tiers of OpenAI&apos;s GPT-5.4: standard, mini, and nano.</description><pubDate>Tue, 14 Apr 2026 08:48:00 GMT</pubDate><category>AI Agent Testing</category><category>AI Agents</category><author>Mahey Qadir</author></item><item><title>Testing for Speculation using Voxli</title><link>https://voxli.io/blog/testing-for-speculation-using-voxli/</link><guid isPermaLink="true">https://voxli.io/blog/testing-for-speculation-using-voxli/</guid><description>In our last post we covered the risks of agent speculation. Today we look at how to set up Voxli to catch those speculations — using a feature called Hallucination detection.</description><pubDate>Thu, 02 Apr 2026 11:47:00 GMT</pubDate><category>AI Agent Testing</category><category>How-to-guide</category><author>Mahey Qadir</author></item><item><title>The Risks of Agent Speculation</title><link>https://voxli.io/blog/risks-of-agent-speculation/</link><guid isPermaLink="true">https://voxli.io/blog/risks-of-agent-speculation/</guid><description>It&apos;s no surprise that hallucinations are a common known failure during agentic AI testing. The agent starts to overpromise, begins to fabricate answers and even claims that it…</description><pubDate>Fri, 27 Mar 2026 15:26:00 GMT</pubDate><category>AI Agents</category><category>AI Agent Testing</category><category>LLM Testing</category><category>Model Behavior</category><category>Reasoning Models</category><category>Agent Reliability</category><category>Conversational AI</category><category>AI Quality Assurance</category><category>Support Agent</category><author>Voxli</author></item></channel></rss>