Nearly 1,000 questions across diverse datasets used to measure answer reliability and response time
BOSTON, July 18, 2024 – Just months after demonstrating answer quality superiority over OpenAI, Google, Amazon, Cohere, and others, CustomGPT.ai again excelled in RAG benchmark analysis of anti-hallucination performance comparing its enterprise AI platform with OpenAI’s Assistant API V2. In testing involving 945 questions across nine diverse data sets, CustomGPT.ai outperformed OpenAI by achieving a 10 percent lower hallucination rate, 13 percent higher accuracy rate, and 34 percent faster average response time.
“In today’s AI race, companies must adopt an ‘anti-hallucination first’ focus,” said CustomGPT.ai founder and CEO Alden Do Rosario. “We founded our company on this premise, and we’re thrilled new research further validates our technology, especially for the 6,000-plus customers we now serve.”
As organizations bring AI into their operations, they take on responsibility for the information it generates. “To reduce risk, entities should adequately vet foundational AI technology and use solutions that are proven.”
As skeptics in both B2C and B2B question AI’s reliability, precision, and performance, Do Rosario believes these findings will especially resonate in industries such as legal sectors, finance, healthcare, and education, where the business impact of answer accuracy is paramount.
Hallucinations — instances where AI generates information not grounded in reality or provided context — can contribute to misinformed decision-making, compliance issues, safety risks, and erosion of trust in AI. This research highlights nuanced differences in context and reliability, reinforcing how CustomGPT.ai outperforms OpenAI with more comprehensive and accurate responses compared to OpenAI’s answers which often lack detail or completely miss the mark.
Validated by Tonic.ai, a pioneer in data mimicking and de-identification, the research supports the use of Retrieval-Augmented Generation (RAG) to help mitigate AI hallucinations and support delivery of more precise and reliable information.
Foundationally, this benchmark went far beyond recent research by using 945 rather than 55 questions and testing against nine datasets spanning topics from public health to literature rather than one single dataset. An ‘answer consistency binary’ metric was also used whereby any deviation from the expected answer resulted in a failed response.
Do Rosario said this research significantly ups the ante for statistical significance, data diversity, and scoring rigor.
“Gone are the days of organizations needing to settle for chatbots that generate inaccurate responses, especially from short-sighted, underperforming, or overpriced AI vendors,” he stated. “The future is wide open for gen AI to responsibly deliver comprehensive and contextually accurate information in order to truly help organizations advance decision-making capabilities, improve operational efficiency and increase revenues.”
About CustomGPT.ai
CustomGPT.ai offers a novel, business-grade, privacy-first, no-code generative AI platform. The technology makes it quick, easy, and affordable for anyone — regardless of technical expertise — to ingest their own content and data, to build custom bots and other GPT agents, and to deploy these solutions with confidence. CustomGPT.ai leverages advanced large language models (including OpenAI’s GPT-4) to offer the industry’s best accuracy and anti-hallucination protection. Nearly 6,000 entities rely on CustomGPT.ai to deliver SOC2-compliant solutions that improve operational efficiency, enhance customer engagement, and increase sales – including Adobe, the Massachusetts Institute of Technology, the Dominican Republic’s GPTLegal, and the UK’s DivorceOnline. REST APIs and SDKs are available for developers, ISVs, digital agencies, and resellers. Visit enterprise overview or contact hello@customgpt.ai.
Build a Custom GPT for your business, in minutes.
Drive revenue, save time, and delight customers with powerful, custom AI agents.
Trusted by thousands of organizations worldwide


Frequently Asked Questions
How do you test hallucination in a RAG system?
A strong RAG hallucination test checks whether answers stay grounded in approved source material. In the July 2024 benchmark, 945 questions across nine datasets were used, and evaluators applied an “answer consistency binary” metric, so any deviation from the expected answer counted as a failure. If you want a strict test, that pass-fail approach is more reliable than giving partial credit to answers that sound polished but add unsupported details.
What were the anti-hallucination benchmark results versus OpenAI?
The benchmark, validated by Tonic.ai, reported a 10 percent lower hallucination rate, 13 percent higher accuracy rate, and 34 percent faster average response time than OpenAI’s Assistant API V2. If you are comparing RAG systems, those three numbers matter together because they measure reliability, correctness, and speed in the same test.
Why does binary scoring matter in an anti-hallucination benchmark?
Binary scoring matters because any deviation from the expected answer is treated as a failed response, which raises the bar for systems that need dependable outputs. That standard reflects how careful teams evaluate real deployments, not just demo performance. Brendan McSheffrey of The Kendall Project said, “We love CustomGPT.ai. It’s a fantastic Chat GPT tool kit that has allowed us to create a ‘lab’ for testing AI models. The results? High accuracy and efficiency leave people asking, ‘How did you do it?’ We’ve tested over 30 models with hundreds of iterations using CustomGPT.ai.”
Why is a larger, more diverse RAG benchmark more credible?
A benchmark is more credible when it tests many questions across varied subject matter instead of a single narrow dataset. Here, 945 questions across nine datasets were used, and the research was validated by Tonic.ai. That combination makes the results more persuasive than small studies that rely on limited question counts or only one dataset.
Can reducing hallucinations also improve response time?
Yes. In this benchmark, the system with the lower hallucination rate was also 34 percent faster on average than OpenAI’s Assistant API V2, so stronger grounding did not require a speed tradeoff in this test. That matters because fast answers are only useful when they are also reliable. Evan Weber described the practical upside this way: “I just discovered CustomGPT, and I am absolutely blown away by its capabilities and affordability! This powerful platform allows you to create custom GPT-4 chatbots using your own content, transforming customer service, engagement, and operational efficiency.”
What should regulated industries look for in an anti-hallucination benchmark?
If you work in legal, finance, healthcare, or education, look for four things: low hallucination rates, strict scoring, diverse test data, and independent validation. For deployment readiness, also check whether a vendor is SOC 2 Type 2 certified, GDPR compliant, and states that customer data is not used for model training. Dan Mowinski summed up the buyer mindset well: “The tool I recommended was something I learned through 100 school and used at my job about two and a half years ago. It was CustomGPT.ai! That’s experience. It’s not just knowing what’s new. It’s remembering what works.”
Related Resources
These reads add useful context on retrieval quality and the broader generative AI landscape.
- The Rise of RAG — An overview of why retrieval-augmented generation has become central to building more accurate, trustworthy AI systems.
- Best Generative AI Tools — A practical look at leading generative AI platforms and how they compare across common business use cases.