CustomGPT.ai Blog

CustomGPT.ai Once Again Outperforms OpenAI for Accuracy

July 22, 2024

9 min read

CustomGPT.ai accuracy comparison context shown via AI RESEARCH BENCHMARK STUDIES cover with neon charts and bars

CustomGPT.ai has outperformed OpenAI’s Assistant API V2 with greater accuracy, fewer hallucinations, and faster average response times in a new, more thorough evaluation.

The latest study compares CustomGPT.ai and OpenAI’s Assistant API V2 performance across 945 questions from diverse datasets.

It has been third-party validated by benchmark experts Tonica.ai, who performed a prior assessment a few months ago in which CustomGPT.ai outperformed OpenAI, Google, Amazon, and Cohere.

Limiting or eliminating AI hallucinations, where AI generates information not grounded in reality or provided context, must be a priority for organizations adopting AI technologies. It’s also the premise by which CustomGPT.ai was founded. We’re thrilled to see the results of this new comparison, which firmly underlines CustomGPT.ai’s potential for serving companies requiring high-precision AI solutions and the quality of the experience received by our 6,000+ existing customers.

New Anti-Hallucination Benchmark Cements CustomGPT.ai’s Potential

The new study, performed by Atman Academy and validated by Tonic.ai, set a new standard for excellence by using nearly 1,000 questions from nine very different datasets. In addition to an expanded question set and more diverse text, the study used a much stricter evaluation metric that required 100% accuracy for a passing score.

The comparison also pitches CustomGPT.ai against OpenAI’s newest version of its business chatbot offering—Assistant API V2, which has advanced file search capabilities.

CustomGPT.ai’s performance was remarkable, demonstrating:

13% Higher Accuracy Rate: Compared to Assistant API V2. This means fewer inaccurate responses and more reliable information delivered by CustomGPT.ai chatbots and AI assistants.

10% Lower Hallucination Rate: CustomGPT.ai’s advanced algorithms more effectively filter out irrelevant information, reducing the likelihood of hallucinations, where AI delivers false or ungrounded responses.

34% Faster Average Response Time: These more accurate responses are delivered much faster, demonstrating improved efficiency without sacrificing the quality of AI’s answers.

Adoption of an Anti-Hallucination First Focus is Vital

The deployment of AI solutions comes with a great responsibility to ensure the quality and accuracy of the responses delivered. CustomGPT.ai CEO Alden Do Rosario recommends:

“To reduce risk, entities should adequately vet foundational AI technology and use solutions that are proven.”

Do Rosario believes this latest study’s findings will especially resonate in industries where accuracy is paramount, such as the legal sector, finance, healthcare, and education. He says:

“In today’s AI race, companies must adopt an ‘anti-hallucination first’ focus.”

AI skeptics rightly challenge AI’s reliability, precision, and performance. AI hallucinations can lead to misinformed decision-making, compliance issues, safety risks, the erosion of trust in AI, severe reputational damage and even legal risks for organizations unable to mitigate the risks of AI.

“Gone are the days of organizations needing to settle for chatbots that generate inaccurate responses, especially from short-sighted, underperforming, or overpriced AI vendors,” adds Do Rosario.

“The future is wide open for gen AI to responsibly deliver comprehensive and contextually accurate information in order to truly help organizations advance decision-making capabilities, improve operational efficiency, and increase revenues.”

Robust Evaluation of Retrieval-Augmented Generation (RAG) in Mitigating AI Errors

The study assesses the performance of Retrieval-Augmented Generation (RAG) technology, which is used by both CustomGPT.ai and OpenAI. RAG drastically enhances the capabilities of generative AI and large language models (LLMs). LLMs are a foundation for natural language processing and enable text generation and question-answering, but they rely on large data sets, often outdated data, and can deliver inaccurate or inconsistent responses.

RAG leverages LLMs but also external knowledge sources. It retrieves answers from information provided explicitly by a company or organization before using the LLM to enrich the response, producing accurate, contextually relevant answers grounded in real-world knowledge.

CustomGPT.ai diagram maps Retrieval Augmented Generation: query, search, relevant docs, and Pre-trained LLM response.

Study Methodology

The objective of this latest study project was to benchmark CustomGPT.ai against OpenAI’s Assistants in RAG benchmark testing, specifically assessing performance in the reduction of hallucinations. The goal was to position CustomGPT.ai as a superior AI solution for industries where precision is critical, such as legal, medical, and financial services.

The assessment used a systematic methodology to ensure accuracy and reliable results, a key consideration in understanding the business impact of AI accuracy. GPT-4o was the assessor. In contrast to the previous study, which used just 55 questions, the new study used 945 questions on various topics. Tests were conducted in a controlled environment using the same hardware and software configurations.

The full technical analysis of the CustomGPT.ai anti-hallucination answer consistency benchmark explains the complete methodology and approach.

The “Answer Consistency Binary” Metric

The “Answer Consistency Binary” metric was used to evaluate CustomGPT.ai and OpenAI Assistant performance. In essence, the metric leaves no room for ambiguity in the responses. If the response delivered to a question is entirely consistent with the provided context, it passes (scored as 1). Any inconsistency in the response is an immediate fail (scored as 0).

Tonic.ai, developers of Tonic Validate and industry leaders in RAG metrics and independent validation methodologies, validated the study approach, which closely follows Tonic.ai’s RAG evaluation approach to ensure a robust and replicable process.

Tonic.ai defines the Answer Consistency Binary Metric used in the evaluation:

CustomGPT.ai benchmark excerpt defines answer consistency as 1/0 metric with Retrieve Context, Generate, Evaluate steps

The metric is particularly important in order to deliver unambiguous measurement, eliminating any grey areas in evaluations. It’s a highly effective approach for identifying hallucinations and it gives a fair comparison and a level playing field for comparing CustomGPT.ai against OpenAI Assistant API V2.

Analyzing Responses

The following is an example question from the study and the responses from CustomGPT.ai and OpenAI from the benchmark testing:

Question: “What is the significance of the Turing Test in the field of artificial intelligence?”

CustomGPT.ai Answer (Score: 1): “The Turing Test, proposed by Alan Turing in 1950, is a significant benchmark in artificial intelligence to determine whether a machine can exhibit intelligent behavior indistinguishable from that of a human. It has influenced the development and evaluation of AI systems, highlighting the importance of natural language processing and human-computer interaction.”

OpenAI Answer (Score: 0): “The Turing Test is used to see if a machine can act like a human. It’s important in AI.”

And the study’s analysis of this response:

Analysis: CustomGPT.ai offers a detailed explanation of the Turing Test’s significance, including its historical context and impact on AI development. OpenAI’s answer, while not incorrect, lacks depth and fails to capture the test’s full importance in the field.

Study Findings and Results

The quantitative findings from all 945 questions were as follows:

Inconsistent Responses (Binary Score 0):

OpenAI: 513 instances

CustomGPT.ai: 457 instances

Study interpretation: OpenAI had a higher number of inconsistent responses, indicating more frequent hallucinations.

Consistent Responses (Binary Score 1):

OpenAI: 432 instances

CustomGPT.ai: 488 instances

Study interpretation: CustomGPT.ai had a higher number of consistent responses, showcasing its superior ability to maintain context accuracy.

CustomGPT.ai leads OpenAI in Answer Consistency Binary Metric, with more score-1 counts and fewer score-0 results.

The study drew the following insights:

Accuracy and Consistency: CustomGPT.ai achieved a 13% higher Accuracy Rate compared to OpenAI, providing consistent answers with no extraneous information the majority of the time.

Response Time: CustomGPT.ai demonstrated a 34% faster average response time, indicating improved efficiency without sacrificing accuracy.

Hallucination Reduction: The 10% lower hallucination rate suggests that CustomGPT.ai’s advanced algorithms more effectively filter out irrelevant information, reducing the likelihood of generating unfounded content.

The study also drew a number of technical insights, including that the lower hallucination rate implies CustomGPT.ai may have “enhanced capabilities in distinguishing between relevant and irrelevant information, possibly through advanced semantic understanding or improved knowledge base integration.” Also, the performance gap maintained in the large sample size of 945 questions “suggests that CustomGPT.ai’s improvements are likely to hold at scale.”

High-Precision and Contextual Integrity

The results of this comprehensive benchmark study position CustomGPT.ai as an AI solution for chatbots, AI agents, and AI assistants in industries where precision is vital.

“The consistent performance across various questions and contexts demonstrates its robustness and adaptability, which is crucial for deploying AI in dynamic environments where context can vary significantly.”

The Atman Academy’s research team is available for a detailed breakdown of its methodology.

CustomGPT.ai provides a business-grade, privacy-first, zero-code generative AI platform. SaaS technology makes it quick, easy, and affordable for anyone—regardless of technical expertise—to provide their own content and data to build custom AI chatbots and other GPT agents with chatbot calls to action and confidently deploy these solutions. We leverage advanced large language models (including OpenAI’s GPT-5 and GPT-4) to offer industry-leading accuracy and anti-hallucination protection.

Frequently Asked Questions

What did the latest CustomGPT.ai vs OpenAI benchmark report on accuracy?

The reported result is that CustomGPT.ai outperformed OpenAI Assistant API V2 on three outcomes: answer accuracy, hallucination rate (fewer hallucinations), and average response speed (faster responses).

How large and diverse was the evaluation dataset?

The comparison is described as using 945 questions, and also as nearly 1,000 questions, drawn from nine different datasets. That indicates a broader test scope than a small demo-style evaluation.

Who conducted and validated the benchmark?

The benchmark is described as performed by Atman Academy and validated by Tonic.ai (also referenced as Tonica.ai in the source snippet).

Was this a one-time result or consistent with earlier tests?

The source says this was a new comparison and also references a prior third-party assessment from a few months earlier where CustomGPT.ai reportedly outperformed OpenAI, Google, Amazon, and Cohere.

Why is hallucination resistance emphasized in these benchmark results?

The source frames hallucination control as a core requirement for organizations adopting AI, stating that limiting or eliminating ungrounded answers should be a priority.

Does the comparison context include alternatives beyond OpenAI?

Yes. While the newest benchmark discussed here is against OpenAI Assistant API V2, the source also cites an earlier third-party assessment that included Google, Amazon, and Cohere.

Related Resources

These pieces add useful context on why grounded AI systems like CustomGPT.ai deliver more reliable answers.

RAG Vs. AI Hallucinations — Explores how retrieval-augmented generation reduces hallucinations and improves answer quality in production AI systems.
How CustomGPT.ai Works — Walks through the platform’s approach to turning your content into accurate, source-grounded AI responses.
Pharma RAG Solutions — Shows how pharmaceutical teams use CustomGPT.ai to deliver compliant, trustworthy answers from complex internal knowledge.

accuracy, Anti-Hallucination, benchmark, customgpt, open ai, rag