How Large Language Models Actually Work

A large language model has never seen the world. It has never held a cat, watched a sunset, or run a single experiment. It has only read. And yet ask one to explain quantum tunnelling, write a sonnet about regret, or fix your code, and it will attempt all three in seconds. The real puzzle of modern AI is how a machine that learned to do one narrow thing, predict the next word, ended up looking so much like it understands. The honest answer is stranger, and simpler, than most people expect.

It is a guessing game

Underneath the chat window, a language model does exactly one thing. It predicts the next token, a token being a chunk of text roughly the size of a short word. You hand it a sequence and it returns a probability for every token that could come next. The phrase "the cat sat on the" makes "mat" very likely and "photosynthesis" almost impossible. The model picks a token, appends it, feeds the longer sequence back in, and repeats. That loop, one token at a time, is the entire performance.

The same loop, sped up. Because generation runs strictly one token at a time, it is slow. The previous issue, on speculative decoding, was all about cheating that loop with a small model that guesses ahead. Everything in this issue is the machine doing the guessing.

The model does not choose a word. It produces a full probability distribution over every possible next token, then one is sampled.

Words become numbers

A model cannot do arithmetic on letters, so the first thing it does is turn every token into an embedding: a long list of numbers, a single point in a space with thousands of dimensions. Training slowly nudges these points around so that tokens used in similar ways end up near each other. "King" and "queen" become neighbours. "Paris" sits to "France" roughly as "Tokyo" sits to "Japan". Meaning, in a language model, is geometry.

A slice of the space: words with similar roles cluster, and consistent relationships (man to king, woman to queen) become parallel directions.

This is the quiet trick that makes everything else possible. Once words are points in space, the messy, ambiguous business of language becomes something a machine can measure, compare, and add up.

Attention: reading in context

The breakthrough that created today's models arrived in 2017 with a design called the transformer. Older models read left to right and tended to forget the start of a long sentence by the time they reached the end. The transformer added attention: a mechanism that lets every token look at every other token and decide which ones matter right now.

Consider the word "bank". In "money in the bank" and "the bank of the river" it means two different things, and attention is how the model works out which. Each layer of attention lets words trade information and sharpen their meaning in context. Stack dozens of these layers and the model builds representations rich enough to track plot, code structure, and argument across thousands of words.

A language model does not store facts the way a database does. It stores the shape of language, and reconstructs the facts on demand.

Training: prediction at planetary scale

A model learns in two stages. The first is pretraining. You show it an enormous slice of the internet, books, and code, and you make it play the next-token game billions of times. Each time it guesses wrong, you adjust its internal numbers, the parameters, a tiny amount in the direction that would have made the right token more likely. No human labels are needed, because the text is its own answer key: the next word is always sitting right there.

Out of this dull, repetitive game something remarkable falls out. To predict the next word well across trillions of words, the model is pushed to absorb grammar, facts, writing styles, and even rough reasoning, none of it programmed in by hand. GPT-3, the model that made this obvious to the world, had 175 billion of those parameters. The largest models since are bigger still.

Raw pretraining gives you a brilliant autocomplete, not an assistant. The second stage, alignment, fixes that. A smaller, carefully curated round of training (instruction tuning, then learning from human preferences) teaches the model to follow a request, stay helpful, and refuse the things it should. This is the polish that turns a text predictor into something you can talk to.

What it is not

Three common misreadings are worth clearing away. A language model is not a database that looks up answers. Facts live in it as statistical tendencies smeared across billions of weights, not as records on a shelf, which is exactly why it can be fluent and wrong in the same breath. It is also not browsing the internet as it replies, unless someone has bolted a tool on to let it. By default it works only from what it absorbed in training, frozen at some cut-off date.

And it is not thinking in the background between your messages. When the model is not actively producing a token, nothing is happening at all. Each reply is reconstructed from scratch, which is why the context you give it in the prompt, and the words it has generated so far, are the only things it can lean on.

So does it understand?

Here is the uncomfortable middle. The model has no senses, keeps no memory between chats by default, and holds no built-in notion of truth. It is producing plausible text, which is exactly why it can state a falsehood with the same easy confidence as a fact. That failure even has a name, hallucination, and it is a direct consequence of how the machine works.

And yet predicting the next word well enough, often enough, seems to require building internal models of grammar, of the world, and of what the writer is trying to do. Whether that adds up to understanding is a genuine and unsettled debate. What is no longer in doubt is that "just predicting the next word" is doing far more work than the phrase suggests.

Why it matters

Once you see the model as a next-token predictor with a vast, learned sense of how language fits together, its behavior stops being magic. You understand why it is fluent yet sometimes wrong, why rephrasing a prompt changes the answer, and why feeding it the right context is so powerful. Almost everything else in AI, from speculative decoding to retrieval to agents, is built on this one foundation.

Key terms

Token: A chunk of text, often a word-piece, that the model reads and predicts one unit at a time.
Embedding: A vector that turns a token into a point in space, where nearby points mean related things.
Attention: The transformer mechanism that lets each token weigh every other token to settle its meaning in context.
Parameters (weights): The billions of adjustable numbers that training tunes; together they hold what the model knows.

Sources

Vaswani et al. "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762
Brown et al. "Language Models are Few-Shot Learners" (GPT-3). 2020. arXiv:2005.14165
Jay Alammar. "The Illustrated Transformer." jalammar.github.io
Andrej Karpathy. "Let's build GPT: from scratch, in code, spelled out." YouTube
3Blue1Brown. "But what is a GPT? Visual intro to transformers." YouTube