8. Reducing Hallucinations

Reducing Hallucinations #

The biggest headache for users of Large Language Models is the risk of hallucinations. When you’re writing a poem, you need not worry about them, but in almost all other contexts, you want the answer of the LLM to be factually correct.

Eliminating hallucinations is one of the holy grails of language modeling. So far, no technique has been shown to bring the risk of incorrect claims to zero. There are some methods, however, that we can use to reduce their number.

Parametric Knowledge vs External Sources #

Tip: Rely on tools rather than the model’s parametric knowledge

LLMs are most prone to hallucinating when they use their parametric memory: the calculations the model has learned through its training process. These calculations were optimized to predict the next word in a sequence. Whether these predicted words together express a fact was never part of the equation. You could argue that when they do, this is more a byproduct of the training process than its explicit goal: it merely follows from the situation that the training data contains more correct than incorrect information. If the claim that the earth is flat outnumbered the earth is round, an LLM trained on that data would happily become a flat earther.

File Sources #

One way to reduce the frequency of hallucinations, is to give the model a reliable source to consult. Many chatbots, like ChatGPT and Claude, allow users to upload a file and ask questions about it. Some can even be connected to a cloud service like Google Drive or OneDrive and read all the files users have stored there. Providing the model with one or more such reliable sources reduces the number of hallucinations, but does not bring it down to zero. It is still essential to check all important information.

The World Wide Web #

For a lot of tasks, the ultimate source of information is the world wide web. In fact, whenever we give it a prompt for which ChatGPT considers its internal memory insufficient, it decides to consult the web. This means that it will build a search query from your prompt, submit it to a search engine, read the most relevant results, and then compile its answer from those results. This again increases the chances that this answer is correct. If ChatGPT’s internal memory does not allow it to predict, say, someone’s date of birth with a sufficiently high degree of confidence, it often decides to search the world wide web for any sources, like Wikipedia, that have this information and take it from there. There are of course instances where it misjudges its own knowledge, continues without any helplines and produces incorrect information. In those cases you can explicitly prompt it to ‘use the world wide web’.

The problem is that many of today’s models are not linked to a search engine. Claude, for example, and many of the younger models like DeepSeek, do not have this capability and always rely on their internal memory only. This also goes for the open-source models you can download on your own computer, unless you connect them with the necessary tooling yourself. This means their answers will contain far more hallucinations than those of models capable of web search. If you’re doing research of any kind, it is therefore advisable to use a model that has access to the web, like ChatGPT or Copilot.

Programming code #

We’ve seen before that large language models struggle with mathematics and don’t actually perform the calculations we ask them to do. Luckily ChatGPT has a solution to that as well. Whenever we give it a complex calculation that it is unlikely to answer correctly with its word prediction capabilities, it switches to another strategy. In these cases, rather than producing natural language, ChatGPT will decide to generate the programming code for your calculation in a programming language called Python. Next, it runs this programming code and builds a response on the basis of the answer. As this happens, you’ll see a short message that ChatGPT is Analyzing your instruction. Behind the answer, there will also be a small icon that you can click to see the programming code. As with the internet search above, this strategy greatly increases the probability that the result is correct. The only risk is that ChatGPT may not have generated the right code — which for straightforward mathematical instructions is very small.

Consistency Prompting #

Tip: Take the most frequent answer from multiple responses.

We’ve seen before that LLMs are not deterministic: when we repeat our question, they (almost) always give a different answer. Advanced users can influence this degree of variation with multiple parameters, like temperature or top k, which are available in tools like OpenAI’s Dashboard. In more creative tasks, it’s often desirable to aim for higher variation, while in other ones, we actually want lower variation.

Self-consistency (Weng et al. 2022) is based on the intuition that tasks that require deliberate thinking typically have multiple reasoning paths to the correct answer. First, it uses chain-of-thought prompting to have the model generate intermediate reasoning steps before it gives the answer. Second (and in contrast to traditional chain-of-thought prompting), it has the language model generate many different responses. Finally, it selects the most frequent answer from these responses.

Self-consistency selects the most frequent answer from multiple chain-of-thought responses.
(Source: Weng et al. 2022)

Self-consistency selects the most frequent answer from multiple chain-of-thought responses.
(Source: Weng et al. 2022)

Weng et al.’s experiments on a range of datasets show that with as few as five reasoning paths, the accuracy of the most frequent answer significantly exceeds that of the single-answer chain-of-thought paradigm. The more reasoning paths are sampled, the further the accuracy improves.

Self-consistency leads to more accurate answers than traditional chain-of-thought prompting.
(Source: Weng et al. 2022)

Self-consistency leads to more accurate answers than traditional chain-of-thought prompting.
(Source: Weng et al. 2022)

Verification Prompting #

Tip: Have the LLM factcheck its own response.

Another way of reducing hallucinations is to make use of automatic verification strategies. One such popular strategy is chain-of-verification prompting (Dhuliawala et al. 2023), which takes advantage of the fact that the “knowledge” of a model may vary with the prompt. For example, although a model may know the mother of Céline Dion, it might not know the daughter of Thérèse Tanguay’s daughter.

Claude Sonnet 4 knows Céline Dion’s mother…

Claude Sonnet 4 knows Céline Dion’s mother…

… but does not know her mother’s daughter.

… but does not know her mother’s daughter.

Chain-of-verification (CoVe) prompts the model to fact-check its own response. This happens in four steps:

  1. The LLM gives an initial response,
  2. It plans verification questions to fact-check its initial response,
  3. It answers those questions one by one, and
  4. Generates a final, verified response on the basis of those answers.

Let’s take a look at the example from the original chain-of-verification paper to make this more clear. Assume we ask an LLM to name politicians who were born in New York. The initial response might mention Donald Trump (who was born in New York) alongside Hillary Clinton and Michael Bloomberg (who were born elsewhere). To weed out incorrect answers, the model plans verification questions next: for each politician on the list, it asks where they were born. As it answers these questions one by one, it becomes clear not all politicians meet the requirement. This information is used to compile a final list of correct answers.

Chain-of-verification prompts the model to fact-check its own initial response.
(Source: Dhuliawala et al. 2023)

Chain-of-verification prompts the model to fact-check its own initial response.
(Source: Dhuliawala et al. 2023)

You can split up this chain of verification into multiple prompts, or instruct the model to perform all steps in one prompt. A CoVe-inspired prompt for a list of promising writers younger than 40, for example, would instruct the model to compile an initial list first, then check each author’s date of birth, and conclude with a filtered list. When I presented Claude Sonnet 4 with this prompt, it correctly identified Tommy Orange and Hanya Yanagihara as mistakes in its initial list and came back with an error-free final list.

Unclear prompt

List 10 authors born in Belgium.

Clear prompt

  1. Give a list of promising writers younger than 40.
  2. For each of the writers, check their date of birth.
  3. Compile a final list of promising writers younger than 40.