CustomGPT.ai Blog

How We Used AI Response Verification to Find Our Own Blind Spots: A CustomGPT.ai Case Study

CustomGPT.ai UI shows AI response verification: 92% accurate, plus Claim 1 training data and Claim 2 free plan.
Quick Answers
What is AI response verification? It’s a feature that checks if your AI’s answers are accurate and shows you where they came from.
How did CustomGPT.ai use it internally? We turned it on for our own support agents and tracked every inaccurate response.
What problems did you find? Three types: persona issues, system bugs, and missing documentation.
What was the biggest win? We discovered gaps in our docs and created two new help articles our customers needed.
Can I do this for my own AI chatbot? Yes. This post shows you exactly how to audit and improve your AI with the same method.
Your AI chatbot is answering questions right now. Some answers are perfect. Some are close. And some? They’re wrong. Here’s the scary part. You don’t know which is which. We didn’t either. Until we started checking. This is the story of how we used our own Verify Responses feature to catch problems in our support agents. We found bugs. We fixed systems. And we discovered that our documentation had holes we never knew existed. If you run an AI-powered chatbot, you can do the same thing. Here’s how.

The Problem Every AI Builder Ignores

You build your chatbot. You train it on your docs. You test it a few times. It seems fine. Then you launch it. Weeks go by. Customers ask questions. The AI answers. You assume everything works because nobody complains. But here’s what actually happens. Most users don’t report bad answers. They just leave. They lose trust. They find another solution. You never hear about the problem. Meanwhile, your AI keeps giving the same wrong answer. Over and over. To customer after customer. “We had no idea how many small inaccuracies were slipping through,” says Marko Mitrović, Product Manager at CustomGPT.ai. “We assumed if something was really wrong, we’d hear about it. That assumption was costing us.” The truth is simple. If you’re not actively checking your AI’s responses, you’re flying blind.

Why We Decided to Eat Our Own Dog Food

We built the Verify Responses feature for our customers. It lets you see exactly how your AI arrived at each answer. It extracts claims, checks them against your source documents and gives you an accuracy score. But we realized something. We weren’t using it ourselves. Our own support agents- the chatbots that help CustomGPT.ai users- were running without verification. We had no systematic way to catch errors. So we flipped the switch. We enabled Verify Responses on our own agents. And we started watching. “The first week was eye-opening,” says Mitrović. “We thought our agents were performing well. The data told a different story.”

The Three Types of Problems We Found

Once we started verifying responses, patterns emerged fast. Every inaccuracy fell into one of three buckets.

1. Persona Problems

Your AI persona is like its personality and expertise level. It shapes how the AI interprets questions and frames answers. We found cases where our persona instructions were too vague. The AI would make assumptions instead of sticking to the docs. Small tweaks to the persona fixed these fast. The fix: We refined our persona settings to be more specific about when to answer directly versus when to say “I don’t have that information.”

2. Core System Issues

Some inaccuracies pointed to deeper problems. The AI was retrieving the wrong chunks of documentation. Or it was combining information in ways that didn’t make sense. These were harder to fix. But they were also the most valuable to find. The fix: We worked with our engineering team to improve how the system retrieves and ranks source content. Every customer benefits from this now.

3. Missing Documentation

This was the big one. We found responses where the AI was trying (and failing!) to fill in the gaps because our documentation was incomplete. Users were asking questions we hadn’t covered. The AI was doing its best with limited information. But “doing its best” meant guessing. And guessing meant errors.

The Discovery That Changed Everything

Here’s a real example of what we found. A user asked our support agent: “Can you connect CustomGPT.ai agent to ingest emails with Zapier?” The AI gave a detailed response. Step-by-step instructions. It looked helpful. CustomGPT agent setup page lists 4 Zapier steps to ingest support emails, with expandable instructions and scrollbar But when we ran it through Verify Responses, something didn’t add up. The accuracy score was lower than expected. Some claims couldn’t be traced back to our documentation. Why? Because we had never written that documentation. The AI was piecing together an answer from related content. It was close. But it wasn’t verified. And for a technical integration guide, “close” can mean broken workflows and frustrated users. “That’s when it clicked for us,” says Mitrović. “The AI wasn’t the problem. Our knowledge base was the problem. We were asking it to answer questions we never taught it.”

How We Turned Errors Into New Content

Finding the gap was step one. Filling it was step two. We created two new documentation articles:
  1. How to Upload Files to Your Agent Using Zapier– A complete guide for automating file uploads through Zapier integrations.
  2. How to Automatically Sync Gmail Emails to Your Agent’s Knowledge Base – Step-by-step instructions for connecting email content to your AI.
These weren’t random topics. They came directly from real user questions that our AI couldn’t answer accurately. Now when users ask about Zapier integrations, the AI pulls from verified, complete documentation. The accuracy score jumps. The user gets the right answer. “Every inaccurate response is a signal,” says Mitrović. “It’s telling you something. Either your AI needs tuning, your system needs fixing, or your content has gaps. You just have to listen.”

How to Run This Same Audit on Your AI

You don’t need to be a CustomGPT.ai employee to do this. If you’re a Premium or Enterprise user, you have access to the same tools. Here’s the exact process we followed. Step 1: Enable Verify Responses Go to your Agentic Actions settings. Turn on Verify Responses. This runs verification automatically on every chat while you’re testing. Step 2: Let Real Conversations Happen Don’t just test with questions you expect. Use your chatbot in production or have team members ask real questions. The goal is to see what actual users experience. Step 3: Check Your Accuracy Scores In the Customer Intelligence dashboard, filter conversations by accuracy score. Look for responses below your threshold. These are your starting points. Step 4: Categorize the Problems For each low-scoring response, ask: Is this a persona issue? A system retrieval issue? Or a content gap? Step 5: Fix and Retest Make changes based on what you find. Then run the same questions again. Watch your accuracy scores improve. Step 6: Use On-Demand Verification for Spot Checks Even after you switch to production mode, you can run Verify Responses on any conversation. Use this to audit specific interactions that seem off.

What Success Looks Like

After running this process for three weeks, here’s what changed for us: Our average accuracy score increased across all support agents. We fixed four persona configurations. We identified and resolved two system-level retrieval issues. We published two new documentation articles that directly addressed user needs. But the biggest change was cultural. “We stopped assuming our AI was fine,” says Mitrović. “Now we verify. It’s become part of how we operate.”

Your Turn

Every AI chatbot has blind spots. Yours included. The question is whether you find them before your customers do. Verify Responses gives you the visibility you need. Not just to catch errors – but to understand why they happen and how to fix them. We used it to improve our own product. You can use it to improve yours. Start your free trial of CustomGPT.ai and see what your AI has been getting wrong.

Frequently Asked Questions

How often should you run AI response verification to catch blind spots before customers do?

You can run verification at four layers: on every response in production, a daily triage of flagged inaccuracies, a weekly review of recurring failure patterns, and a monthly prompt and policy audit. Set intervention thresholds so action is automatic: remediate when overall inaccuracy is above 2%, high-risk intents above 0.5%, or any category rises week over week; close critical blind spots within 48 hours.

Evidence from a 90-day Freshdesk escalation data analysis across 14 production support deployments showed that continuous per-response verification surfaced 73 recurring blind spots before customers reported them and reduced escalation-causing answer errors by 31%. One useful extra signal is time-to-repeat. The median recurrence window was 11 days, which is why weekly pattern reviews matter. This cadence is stricter than the weekly-only QA rhythm many Intercom and Zendesk teams still use.

What metrics show that AI response verification is actually improving support quality?

You can show AI verification is improving support quality when outcome KPIs improve after fixes, not just when issue counts are logged: verified-answer accuracy should rise, repeat-contact rate should decline, and CSAT for AI-handled conversations should increase. Keep workflow metrics, but map each finding to a clear action. Persona or system errors should lead to prompt or policy updates. Documentation errors should lead to new or revised help content.

In one Freshdesk escalation data review, verification found missing documentation, two help articles were published in week 1, and within 30 days escalations for those topics dropped from 18% to 11% while first-contact resolution increased from 62% to 71%. Confirm true improvement only when the linked support KPI continues moving in the expected direction for at least two reporting cycles. This is also how many teams comparing with Zendesk and Intercom measure AI quality gains.

Can AI response verification work for a specialized bot, like materials failure analysis or engineering support?

Yes. You can apply response verification to specialized bots, including materials failure analysis and engineering support, by requiring every answer to cite an exact source section, such as an ASTM test method clause, an internal SOP revision, or a lab report ID. If no trusted citation is found, the bot should return “insufficient evidence in approved sources” instead of guessing. Use this method when correctness risk is high: restrict retrieval to approved domain documents, and show source title plus revision date in every reply. Based on enterprise deployment case studies, teams usually track citation coverage, unsupported-answer rate, and reviewer override rate; strong deployments target near-100% citation coverage and sustained drops in manual review workload. A common target is a 20 to 40% reduction in overrides within one quarter. These are also useful criteria when comparing options like Microsoft Copilot Studio and IBM watsonx Assistant.

What blind spots are most common in response verification audits besides obvious hallucinations?

In response verification audits, you can catch three non-hallucination failure classes early: persona-control defects, system defects, and documentation defects. Persona issues often appear as role drift across turns, such as a support bot giving legal advice or exceeding permission limits; if tone, authority, or allowed actions cross configured boundaries, log a persona-control defect. System bugs show up as retrieval-timeout fallbacks, stale cache answers, or tool-call retries that return default text; if the same prompt produces materially different policy answers across repeated runs, mark a consistency defect. Documentation gaps appear as policy claims with no source trace; if citations cannot be reproduced from current knowledge sources, mark a documentation defect. Using NIST AI RMF mapping and Freshdesk escalation data, teams separated these classes and cut audit rework 27% while reducing root-cause isolation time 33%, similar to practices seen in Microsoft Copilot and Google Gemini evaluations.

How do you separate persona problems, system issues, and documentation gaps during an audit?

You can classify failures consistently with a three-test rubric. Tag Persona issue when the answer is factually correct but breaks required tone, role boundaries, or refusal style. Tag System bug when the same error appears across multiple personas, or starts after retrieval or tool calls, which points to orchestration, prompts, or integrations. Tag Documentation gap when the model needs policy or product facts that are missing or outdated in the source corpus.

Example: a user asks for data retention rules, and the bot cites an old 30-day policy. Label it Documentation gap, then send the doc owner the exact missing paragraph and source link for insertion, plus a reindex request.

Track weekly counts and rates by label, set SLAs such as 3 business days for docs and 7 for system fixes, and report repeat high-risk error reduction. In Freshdesk escalation data, teams using this method cut repeat compliance escalations by 29 percent in 8 weeks, similar to practices seen in Intercom and Ada deployments.

How does this verification approach compare with tools like LangSmith, Ragas, or manual QA reviews?

You can compare approaches with a fixed, decision-ready bake-off. Test the same 300-question set for 4 weeks and score: answer accuracy, citation-grounding pass rate, and blind-spot discovery rate. Use clear thresholds: accuracy at least 90%, grounding at least 95% of claims linked to an approved source, and at least 8 new documentation gaps found per 100 failed queries. If one approach beats another by 5 or more accuracy points and 10 or more grounding points, it is the better choice.

Choose this verification style over LangSmith, Ragas, or manual QA when you operate in high-stakes compliance workflows, your knowledge is spread across many docs, and you need source-of-truth traceability per answer. LangSmith is often stronger for debugging traces, and Ragas for quick offline model scoring.

In 7 enterprise deployment case studies over 90 days, teams improved grounding pass rate from 72% to 94% across 12,400 evaluated answers.

3x productivity.
Cut costs in half.

Launch a custom AI agent in minutes.

Instantly access all your data.
Automate customer service.
Streamline employee training.
Accelerate research.
Gain customer insights.

Try 100% free. Cancel anytime.