Choosing a Model #
There is no shortage of Large Language Models these days. ChatGPT, Claude, Gemini, Grok, Qwen, DeepSeek, Mistral and many others offer high-quality chatbots for many applications. It’s hard to give strong recommendations: the quality of a model often depends on the task and can also be a matter of personal preference. New models are released all the time, so today’s winner for a particular use case may be beaten tomorrow by a new kid on the block. Keep an eye on leaderboards like the LMArena if you want to find out what model currently scores best on tasks like coding, math or creative writing. So, rather than claiming ‘Gemini is best for creative writing’ or ‘Claude excels at code’, in this chapter I want to focus on a number of objective criteria that can help you make an informed decision when comparing LLMs.
Bigger is Better #
One of the most defining characteristics of an LLM is its size. This size is usually expressed as the number of parameters: the weights in their calculations that are tuned during training. Modern LLMs have billions of parameters. The smallest ones may have 2 or 3 billion, while the largest ones have hundreds of billions. The more parameters, the more “brainpower” a model has and the more knowledge it can store. This means that larger models will perform better on most tasks than smaller ones: they will give more accurate answers and hallucinate less often. Size also comes with a downside, however: because larger models need more computing power, they are slower and need more energy to train and run.
Many LLM providers offer both smaller and larger versions of their models. In ChatGPT, the smaller models all have ‘mini’ in their name. As the comparison below between GPT-4o and 4o-mini shows, the larger alternative is always smarter but slower, and (often many times) more expensive. When responses need to be factual and correct, the larger model usually presents the better choice. For simpler, less critical tasks, don’t hesitate to go for the more budget-friendly option.

The performance of OpenAI’s o1 model does not only improve with longer training (left) but also with time spent ’thinking’ (right).
(Source: OpenAI)
The Rise of Reasoning #
Fairly soon after the breakthrough of Large Language Models, developers hit a roadblock. Their models continued to perform better when they saw more training data, but there was hardly any additional training data left to feed them. Inspired by chain-of-thought prompting, which we’ll discuss in the chapter on advanced prompt engineering, they started training their models to generate a so-called reasoning trace before outputting the final answer. This reasoning trace allows models to break down complex tasks into smaller steps, revise their solution and explore alternative paths. Experiments showed that model accuracy on challenging tasks like math and programming improved dramatically with the length of this reasoning trace — or, more informally, the time the models spend “thinking.”

The performance of OpenAI’s o1 model does not only improve with longer training (left) but also with time spent ’thinking’ (right).
(Source: OpenAI)
How do these models learn to “think”? Their developers typically use several carefully crafted training datasets. The first of these is a “cold start” set of manually curated prompts and responses with a reasoning trace. They are used as training examples in the standard second training phase for LLMs. Because this dataset is far too small to lead to robust performance, the model continues to train on a second dataset with reasoning responses it has generated itself. Of course, at this early stage many of its responses are unreliable: before they are used as training data, their final answer is checked (this is easy to do in math and programming tasks) and all examples with an incorrect answer are ignored. Through this combination of handcrafted and automatically generated training data, LLMs can learn to reason in a consistent manner.
One of the most well-known reasoning models is DeepSeek’s R1 model, which caused quite a stir when it was first released. Not only did it show that China had all but caught up with the US in artificial intelligence; DeepSeek’s models had also been far cheaper to train than many of their competitors. To see a concrete example of a reasoning trace, let’s ask R1 to write a haiku on artificial intelligence. As the video below shows, it first recalls what a haiku is and then explores some potential themes (like learning, data or algorithms). After that, it generates a draft line by line, which it checks fo content as well as syllable count. Only when that is all done does it return the final haiku.
Reasoning does not necessarily help with all types of problems. OpenAI’s experiments with their o1 model indicate that users tend to prefer o1 to GPT-4o for challenging tasks like computer programming, data analysis and mathematical calculation, but not for more language-oriented tasks such as personal writing and editing text. This may be a result of the training process, which is focused on verifiable tasks such as math problems, but it may also mean that tasks of a more textual nature require a different approach.

Users tend to prefer o1 to GPT-4o for challenging tasks like math and programming, but not for more language-oriented tasks.
(Source: OpenAI)
You may have noticed my frequent use of quotation marks in this section. This is because it’s important to stress that the reasoning traces of models like DeepSeek-R1 or GPT-o3 are also just text. So when ChatGPT informs us that its model thought for 5 seconds, it was writing rather than thinking. Indeed, researchers from Arizona State University urge everyone to Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces, arguing that casting this process as thinking “is actively harmful, because it engenders false trust and capability in these systems, and prevents researchers from understanding or improving how they actually work” (p.9).
Indeed, the team at Anthropic reported that Reasoning Models Don’t Always Say What They Think (Chen et al. 2025). To test the faithfulness of reasoning traces, they presented Claude Sonnet and DeepSeek-R1 with a set of multiple choice questions, both with and without a hint included in the input. These hints could take various forms: sometimes the correct answer could be inferred from a visual cue (e.g., a tick mark); other times the prompt included an answer suggestion (e.g. A Stanford professor indicates the answer is (A)) or the correct answer had appeared earlier in the conversation. In all cases, both Claude and DeepSeek-R1 showed a tendency to follow this hint — yet their verbalized reasoning processes rarely acknowledged doing so. While reasoning models were considerably more “honest” than standard LLMs, their explanations were faithful in just 25% (Claude) and 39% (DeepSeek) of the cases. Interestingly, the researchers also observed that unfaithful explanations were typically longer than faithful ones, and that models were less likely to admit using a hint when the question was more difficult or when the hint was presumably obtained in an unethical way (e.g., when the prompt mentioned unauthorized system access). In other words: do use a reasoning model for complex tasks, but treat its “thinking” with suspicion.

The percentage of faithful explanations (referring to the hint) for 4 models and 6 types of hints. Source: Chen et al. (2025)
Most advanced chatbots now include reasoning models as well as standard LLMs. OpenAI’s reasoning models are easy to recognize because they all start with an o: o1, o3, o4-mini, o4-mini-high, etc. Claude’s Sonnet and Opus models also feature reasoning capabilities, as do Google’s Gemini Flash and Pro. Most providers also allow users to control the length of reasoning traces, to find the best balance between cost, quality and speed, although this is often only available to developers that integrate the models through their API.
Open models #
Most popular LLMs are closed models. Chatbots like ChatGPT, Claude, Gemini and the like are easily accessible, but they keep their underlying language models private. This means we will never know exactly how they work, can only influence their behavior through prompting and need to share our data with their developers to use them. Open models, by contrast, put all their cards on the table. We can see their architecture, consult all their parameters, and even download them to our own computer to run them locally (provided it is powerful enough). Some teams even release the software that was used to train the model, making it truly open-source.
Some of the most well-known open models are the Llama family (developed by Meta, the company that owns Facebook), the Qwen family (by Alibaba), DeepSeek and some models by French company Mistral. Interestingly, Chinese companies play a remarkably active role in the development of open models. These don’t just allow them to grow their brand, but also an ecosystem of developers that build applications on top of them.

Software like Ollama allows you to chat with LLMs on your own computer.
Many open models are available for download from the Hugging Face Hub. You don’t even need to be a programmer to use them! Software like Ollama is simple to install on a home computer and allows you to chat with a variety of (small) open LLMs. Although these small models do not offer the performance of their larger competitors, you don’t need an internet connection to run them and your data never leaves your computer.