Next-token prediction (LLMs)

Large language models are fundamentally next-token (next-word) predictors: a sequence goes in, the model assigns probabilities over the vocabulary, and generation proceeds one token at a time—optionally sampling from the top few candidates for variety rather than always picking the argmax.

When input–output relationships are highly complex or high-dimensional, linear models fail; neural networks scale to arbitrarily non-linear relationships.
A natural-language lexicon is huge (~tens of thousands of “classes”); code vocabularies are smaller, so code LLMs can feel disproportionately capable at similar parameter scales.