kluster.ai logo
LLM Hallucination Detection Leaderboard
Evaluating factual accuracy and faithfulness of LLMs in both RAG and non-RAG settings with Verify by kluster.ai which provides an API for detecting hallucinations with any model.

The LLM Hallucination Detection Leaderboard is a public, continuously updated comparison of how well popular Large Language Models (LLMs) avoid hallucinations, responses that are factually incorrect, fabricated, or unsupported by evidence. By surfacing transparent metrics across tasks, we help practitioners choose models that they can trust in production.

Why does hallucination detection matter?

  • User Trust & Safety: Hallucinations undermine confidence and can damage reputation.
  • Retrieval-Augmented Generation (RAG) Quality: In enterprise workflows, LLMs must remain faithful to supplied context. Measuring hallucinations highlights which models respect that constraint.
  • Regulatory & Compliance Pressure: Upcoming AI regulations require demonstrable accuracy standards. Reliable hallucination metrics can help you meet these requirements.

How we measure hallucinations

We evaluate each model on two complementary benchmarks and compute a hallucination rate (lower = better):

  1. HaluEval-QA (RAG setting): Given a question and a supporting document, the model must answer only using the provided context.
  2. UltraChat Filtered (Non-RAG setting): Open-domain questions with no extra context test the model's internal knowledge.

Outputs are automatically verified by Verify from kluster.ai, which cross-checks claims against the source document or web results.

Note: Full experiment details, including prompt templates, dataset description, and evaluation methodology, are provided at the end of this page for reference.


Stay informed as we add new models and tasks, and follow us on X or join Discord here for the latest updates on trustworthy LLMs.

HaluEval-QA benchmark (RAG): The model receives a question plus supporting context. We report the % of answers that introduce facts not found in that context โ€” lower is better. See the Methodology section below for more information.

UltraChat benchmark (~11 k prompts, non-RAG): Evaluates open-domain answers when only the question is given. Score is the % of hallucinated responses โ€” lower is better. See the Methodology section below for more information.

{
  • "headers": [
    • "Rank",
    • "Models",
    • "Average Hallucination Rate (%)",
    • "RAG Hallucination Rate (%)",
    • "Non-RAG Hallucination Rate (%)"
    ],
  • "data": [
    • [
      • "<span style="display:none">0001</span><span style="font-size:2.0rem;">๐Ÿฅ‡</span>",
      • "gemini-2.5-pro",
      • 0.97,
      • 1.57,
      • 0.36
      ],
    • [
      • "<span style="display:none">0002</span><span style="font-size:2.0rem;">๐Ÿฅˆ</span>",
      • "claude-sonnet-4",
      • 1.4,
      • 2.21,
      • 0.6
      ],
    • [
      • "<span style="display:none">0003</span><span style="font-size:2.0rem;">๐Ÿฅ‰</span>",
      • "DeepSeek-R1-0528",
      • 1.52,
      • 2.26,
      • 0.78
      ],
    • [
      • "<span style="display:none">0004</span>4",
      • "klusterai-Meta-Llama-3.3-70B-Instruct-Turbo",
      • 1.6,
      • 2.12,
      • 1.09
      ],
    • [
      • "<span style="display:none">0005</span>5",
      • "Llama-4-Maverick-17B-128E-Instruct-FP8",
      • 2.01,
      • 3.34,
      • 0.69
      ],
    • [
      • "<span style="display:none">0006</span>6",
      • "gemma-3-27b-it",
      • 2.09,
      • 3.71,
      • 0.48
      ],
    • [
      • "<span style="display:none">0007</span>7",
      • "DeepSeek-V3-0324",
      • 2.78,
      • 4.66,
      • 0.91
      ],
    • [
      • "<span style="display:none">0008</span>8",
      • "Qwen3-235B-A22B-FP8",
      • 2.96,
      • 5.04,
      • 0.88
      ],
    • [
      • "<span style="display:none">0009</span>9",
      • "gpt-4o",
      • 3.34,
      • 6.05,
      • 0.64
      ],
    • [
      • "<span style="display:none">0010</span>10",
      • "Llama-4-Scout-17B-16E-Instruct",
      • 3.36,
      • 4.23,
      • 2.48
      ],
    • [
      • "<span style="display:none">0011</span>11",
      • "Mistral-Small-24B-Instruct-2501",
      • 6.3,
      • 4.74,
      • 7.85
      ],
    • [
      • "<span style="display:none">0012</span>12",
      • "Qwen2.5-VL-7B-Instruct",
      • 6.95,
      • 9.35,
      • 4.55
      ],
    • [
      • "<span style="display:none">0013</span>13",
      • "Mistral-Nemo-Instruct-2407",
      • 9.68,
      • 10.63,
      • 8.74
      ],
    • [
      • "<span style="display:none">0014</span>14",
      • "klusterai-Meta-Llama-3.1-8B-Instruct-Turbo",
      • 10.3,
      • 8.1,
      • 12.5
      ],
    • [
      • "<span style="display:none">0015</span>15",
      • "Magistral-Small-2506",
      • 18.34,
      • 8.62,
      • 28.07
      ]
    ],
  • "metadata": null
}

Get started with Verify by kluster.ai

Verify is an intelligent agent that validates LLM outputs in real-time.

Quick API examples

Tasks

We evaluate each model using two benchmarks:

Retrieval-Augmented Generation (RAG setting)

RAG evaluates how well a model stays faithful to a provided context when answering a question. The input consists of a synthetic or real context paired with a relevant question. Models are expected to generate answers using only the information given, without adding external knowledge or contradicting the context.

  • Source: HaluEval QA
  • Dataset Size: 10,000 question-context pairs
  • Prompt Format: Prompt with relevant context document
  • Temperature: 0 (to enforce deterministic, grounded outputs)
  • System Prompt: Instructs the model to only use the document and avoid guessing.

System prompt

This is the system prompt use to generate LLM output for RAG setting:

You are an assistant for question-answering tasks.  
Given the QUESTION and DOCUMENT you must answer the QUESTION using the information in the DOCUMENT.  
You must not offer new information beyond the context provided in the DOCUMENT. Do not add any external knowledge.  
The ANSWER also must not contradict information provided in the DOCUMENT.  
If the DOCUMENT does not contain the facts to answer the QUESTION or you do not know the answer, you truthfully say that you do not know.  
You have access to information provided by the user as DOCUMENT to answer the QUESTION, and nothing else.  
Use three sentences maximum and keep the answer concise.

Prompt format

Each prompt is formatted as

DOCUMENT:
{context}

QUESTION:
{question}

Message structure

The models use the following message structure:

messages = [{"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt},]

Real-World Knowledge (Non-RAG setting)

This setting evaluates how factually accurate a model is when no context is provided. The model must rely solely on its internal knowledge to answer a broad range of user questions across many topics. The answers are then verified using web search to determine factual correctness.

  • Source: Filtered from UltraChat prompts
  • Dataset Size: 11,746 single-turn user queries
  • Prompt Format: Single user prompt without additional context
  • Temperature: 1 (to reflect natural, fluent generation)
  • System Prompt: Encourages helpfulness, accuracy, and honesty when unsure.

System prompt

This is the system prompt use to generate LLM output for Non-RAG setting:

You are a helpful, factual, and concise assistant. 
Always try to answer the user's question clearly and completely. 
Do not make up information. If you are unsure or lack the knowledge, say so.

Message structure

The message structure for Non-RAG setting is the same as RAG setting.


Evaluation Method

We use Verify, a hallucination detection tool built by kluster.ai, to classify model outputs:

  • In the RAG setting, Verify checks if the output contradicts, fabricates, or strays from the input document.
  • In the real-world knowledge setting, Verify uses search queries to fact-check the answer based on current, public information.

Each model's hallucination rate is computed as:

Hallucination Rate = (Number of hallucinated outputs) / (Total number of prompts)

A lower hallucination rate indicates better performance.


Examples of Hallucination Detection

RAG Example

Context

Jonathan Stark (born April 3, 1971) is a former professional tennis player from the United States. 
During his career he won two Grand Slam doubles titles (the 1994 French Open Men's Doubles and 
the 1995 Wimbledon Championships Mixed Doubles). He reached the men's singles final at the French Open in 1988, 
won the French Open men's doubles title in 1984, and helped France win the Davis Cup in 1991.

Prompt

Which tennis player won more Grand Slam titles, Henri Leconte or Jonathan Stark?

LLM Answer

Jonathan Stark won two Grand Slam titles, while Henri Leconte won four. 
Therefore, Henri Leconte won more Grand Slam titles.

Verify Label

Hallucination: Yes

Verify Response

The answer introduces new information about Henri Leconte's Grand Slam titles, 
which is not present in the document.

Non-RAG Example

Prompt

What are the ingredients that made the most expensive donut so expensive?

LLM Answer

Okay, the user is asking about the ingredients that make the most expensive donut so expensive. Let me start by recalling which donut holds that title. I remember reading about the Golden Cristal Ube Donut from Manila Social Club. That was priced at around $1,200. But why so high? The main factor is the edible gold, specifically 24-karat gold flakes. Gold is expensive per gram, and covering a whole donut would require a significant amount.
Next, the filling. They used champagne jelly infused with Dom Perignon Rose. Dom Perignon is a luxury champagne brand, especially the Rose variant. Using such a high-end alcohol adds a lot to the cost. Then the ingredients themselves: ube, which is a high-quality purple yam, and strawberries dusted in gold. Sourcing premium produce also contributes, but it's minor compared to the gold and champagne.
I should also consider the exclusivity and presentation. Handcrafting each donut by a chef adds labor costs. The packaging in a luxury box makes it feel more exclusive. Plus, the limited availability and publicity stunts like celebrity involvement drive up the price. The user might be curious if it's actually worth the cost or just a marketing ploy. They might want to understand if the ingredients justify the price or if it's mostly branding.
The user's deeper interest could be in luxury food trends or the psychology behind pricing. Maybe they're researching for a project or just intrigued by extravagant items. It's important to break down tangible vs. intangible factors. Highlighting that the cost isn't just about taste but the experience and status symbol. Emphasize that while the ingredients are premium, the branding plays a huge role. Make sure to clarify that the price is more about exclusivity than the actual food value.
The most expensive donut ever sold was The Golden Cristal Ube Donut created by Brooklyn's Manila Social Club (reportedly priced around $1,200). Its astronomical cost wasn't due to one ingredient, but a combination of ultra-luxury ingredients, exclusivity, and presentation. Here's the breakdown:
24-Karat Edible Gold Flakes: This is the most visually striking and expensive component. The entire donut is encased in a thick layer of pure, edible 24-karat gold flakes. Gold itself is expensive per gram, and covering an entire donut requires a significant amount.
Dom Perignon Rose Champagne Jelly Filling: Instead of a simple custard or jam, the filling was a sophisticated jelly infused with Dom Perignon Rose Vintage Champagne. Dom Perignon is one of the world's most prestigious and expensive champagnes. Using it as a primary flavoring agent adds immense cost.
Premium Base Ingredients:
Ube (Purple Yam): While ube itself isn't wildly expensive, sourcing high-quality, flavorful ube puree for the donut dough contributes to its gourmet status and unique flavor profile.
Handcrafted Dough: Presumably made with premium flour, butter, eggs, etc.
Gold-Dusted Strawberries: Garnishing the top were fresh strawberries meticulously dusted with more edible gold flakes. This adds another layer of gold cost and visual extravagance.
Exclusivity & Labor: Each donut was handcrafted by a skilled pastry chef using meticulous techniques. They weren't mass-produced; they were made to order in very limited quantities.
Luxurious Presentation: It was reportedly presented in an extravagant, custom-designed luxury box befitting its price tag.
Why It's Really So Expensive:
While the ingredients (especially the gold and Dom Perignon) are incredibly costly, the $1,200 price tag goes far beyond simple ingredient cost:
Novelty & Status Symbol: It was designed to be a unique, headline-grabbing luxury item. Buying one was as much about owning something exclusive and extravagant as it was about eating a donut.
Marketing & Publicity: The extreme price was the marketing. It generated massive global media attention for Manila Social Club.
Experience: Consuming something covered in pure gold and infused with ultra-premium champagne is an experience in itself, not just a snack.
Celebrity/High-End Clientele: Reportedly commissioned for celebrities and wealthy individuals, allowing for such a premium price point.
In essence: The extreme cost came from the lavish use of edible gold and Dom Perignon champagne, amplified significantly by the exclusivity, craftsmanship, presentation, and status associated with consuming such an extravagant item. It was more a luxury art piece and status symbol than a traditional breakfast pastry.

Verify Label

Hallucination: Yes

Verify Response

The response claims that 'The Golden Cristal Ube Donut' created by Manila Social Club was priced around $1,200. 
However, multiple sources indicate that the donut was actually priced at $100, not $1,200. 
The response lists several ingredients and factors contributing to the donut's cost, including 24-karat edible gold flakes, 
Dom Perignon Rose Champagne jelly filling, premium base ingredients, gold-dusted strawberries, exclusivity, labor, 
and luxurious presentation. While these details are mostly consistent with the sources provided, 
the price discrepancy raises concerns about the accuracy of other details. Upon closer inspection, 
some details like the use of 'Dom Perignon Rose Vintage Champagne' are not directly supported by the provided sources, 
which mention 'Cristal Champagne' rather than 'Dom Perignon'. The sources provided confirm the use of 
24-karat gold flakes, ube, and Cristal Champagne jelly filling, which aligns with some of the information given in the response. 
The response includes additional details such as 'gold-dusted strawberries' and 'Dom Perignon Rose Champagne Jelly Filling' 
that are not directly corroborated by the sources. The sources do mention Cristal Champagne, not Dom Perignon. 
The overall price and some specific ingredients/details appear to be exaggerated or incorrect in the response.