Predicting words #

Large Language Models predict words. In that respect they resemble the predictive text function in your phone that tries the predict your next word when you type a message. My phone is very bad at this, but you have to remember it is working with limited resources: it has much less memory or computing power at its disposal than ChatGPT or similar applications.

Large Language Models are neural networks that predict the next word in a sequence (Source: Transformer explainer).

Neural networks mimic the brain #

Large Language Models are neural networks — computer models that are inspired by how the human brain works. Of course, they’re not entirely similar. I always remember one of my professors at the University of Edinburgh, who said that neural networks are to brains like birds are to planes. They fly, but they don’t flap their wings.

Neural networks are computer models inspired by the brain.

Like our brains, neural networks are composed of neurons or cells that send signals to each other. These neurons are organized in layers. The neurons in the first layer analyze the input — the text, figure, sound, etc. — and perform calculations on it. Based on the results of this calculation, they will send a signal to the cells in the second layer. These in turn will perform calculations on those signals, and send a signal to the neurons in the third layer. This process continues until it reaches the final layer. This layer will decide the output: the next word in a text, the next pixel in a picture, the next note in a piece of music, etc.

Maths with words #

This leads us to the question how those neurons in the first layer perform calculations on their input. After all, words are not numbers. How does one do maths with words?

They way this works is one of the most crucial breakthroughs in Natural Language Processing of the last twenty years. LLMs map words to a series of numbers, which we call embeddings. It can be helpful to think of these embeddings as a series of coordinates on a map. Picture our globe, where all locations have two coordinates, latitude and longitude. These two numbers will tell you where to find a location and how close to locations are to each other. In the same way, the coordinates of two words will tell you how similar their meanings are. Two synonyms, like bike and bicycle will have very similar coordinates; two words with entirely different meanings, like coffee and equator will have very different coordinates.

The embedding of ‘Belgium’ is close to that of other countries such as ‘Denmark’, ‘Netherlands’ and ‘Luxembourg’. (Source: Tensorflow Embedding Projector).

An interesting characteristic of these coordinates is that they allow us to do maths with words. We can now literally add or subtract meanings. This often gives fascinating results. For example, if you take the coordinates for king, subtract those for man and add those for woman, you end up with coordinates that are very close to those of queen. Similarly, if you take Paris, subtract France and add Germany, you end up in the vicinity of Berlin. Cool, isn’t it?

Embeddings allow us to perform calculations with words.

These calculations are similar to the ones that a Large Language Model makes. When it reads a text, it combines the coordinates of all its words and performs calculations on those coordinates. Each layer continues to work with the output of the previous layer, until we reach the last layer, which assigns a probability to every word in the vocabulary on the basis of this outcome. The LLM then outputs one of the most probable words.

Training an LLM #

This of course brings us to another question: how do LLMs know what calculations exactly they have to perform. Since they contain billions of neurons, it’s impossible for humans to tell the model what to do where exactly. And even if we had the time, we wouldn’t know, since all calculations have complex effects on the following ones. The amazing thing is that the neural networks learn these precise calculations by themselves, in a process we call machine learning.

All Large Language Models today are trained in a similar manner. Their training consists of three training phases. In the first, they learn to predict the next word in a text. In the second, they learn to respond to instructions. In the third, they are aligned to human expectations.

Phase 1: Word Prediction #

To learn what calculations to make, LLMs analyze a countless number of texts. At every step, they look at a part of the text — the first nine words, say — and then they try to predict the next word. Initially, they’re horrible at this game: they haven’t seen many examples yet, don’t know what calculations to make, so they basically make a random guess. But after every guess they look at the correct word — after all, it’s there, they just haven’t looked at it yet. If they guessed incorrectly, they adjust their calculations in order to avoid making this mistake again. In this way they will become better at this game, slowly but surely, while their calculations improve.

This learning process doesn’t happen overnight. Every LLM on the market has been optimized for months, while it crunched through virtually all of the texts on the world wide web, and then some.

Phase 2: Instruction tuning #

Even a language model that can perfectly guess every word in a text, is only of limited use. When we ask ChatGPT a question, we don’t want it guess the correct words following this question. For all we know, this we could be another question, as the internet is full of tests and exams. No, we want it to give us the answer instead.

Bare language models like GPT-3 complete texts; instruction-trained models like InstructGPT follow instructions. (Source: OpenAI)

To ensure LLMs obey our instructions, they have undergone a second training phase. In this phase, they have seen many examples of instructions, together with a correct solution. These training instructions cover a wide range of applications, from factual questions — Joe Biden is the Nth president of the United States. What is N? — to more creative tasks — Write a poem about the sun and moon or Write a four-sentence horror story about sleeping.

In this second training phase, the model learns as it did in the previous one. After it has read the instruction, it tries to predict the words in the response one by one. Whenever it makes a mistake, it updates its calculations in order to improve.

Phase 3: Alignment #

So far, our model has learnt to mimic the texts it has seen. But since it has read basically all of the world wide web, you can imagine it has seen many texts which we do not want it to mimic — content that is in some way sexist, racist, violent, discrimatory, etc. Therefore LLMs typically undergo a third training phase to align with the preferences of its users. Here they are not just taught to avoid generating undesired content, but also to become more helpful.

This third training phase works differently from the previous ones. This time the model is made to output multiple responses for every instruction. Human labellers then rank those responses from most desirable to least desirable. Based on this human feedback, a reward model is trained that steers the model towards more preferred responses and away from unwanted ones.

Test-time improvements #

During the first few years of LLM development, most attention focused improving the training process. Developers collected more data and experimented with multiple ways of aligning models with human preferences. Recently attention has shifted towards improving the performance of the models when they’re being used, without changing their internal calculations.

These test-time improvements can take different forms. On the one hand, some models are primed to answer in specific ways. When they get a complex instruction, like a difficult calculation, they work through the task step by step. Reasoning models, like OpenAI’s GPT-4o or DeepSeeks’s DeepThink generate a long reasoning process before they formulate a final answer.

On the other hand, the most advanced models generate multiple (partial) answers. At each step, they evaluate these candidates and reject all but the most promising ones. Only at the end of the process do they decide on their final response. Like step-by-step response generation, this strategy improves the quality of the final answer.