Here is the requested translation with your style and HTML markup preserved:
I’ve been thinking that understanding how large language models (LLM; like ChatGPT) function explains how our (at least my) brain probably works, and vice versa—observing how the brain functions can lead to a better understanding of how to train LLMs.
You know, LLMs are based on a simple logic—choosing the appropriate next word after N known ones, forming a “context”. For this, LLMs are trained on a gigantic corpus of texts, to demonstrate what words typically follow others in various contexts.
So, when you study any language, like English, this stage is inevitable. You need to encounter a stream of words in any form—written or spoken—so that your brain can discover and assimilate patterns simply through observation or listening (and better yet, both—multimodality).
In LLMs, the basic units are not words, but tokens—words and often parts of words. After processing this vast corpus of texts, it turned out to be straightforward to find simply the most common sequences, which of course turned out to be somewhere full words, and sometimes parts of words. So, when you start to speak a foreign language, especially with a system of endings, you begin to pronounce the beginning of a word, and your brain at that moment boils over the “calculation” of the ending.
When we read text or listen, we actually don’t analyze words letter by letter, because very often important pieces just disappear due to fast or unclear speech, typos. But the brain doesn’t need to sift through all the words that look or sound like the given one, it needs to understand whether what is heard or seen matches a very limited set of words that could logically follow the previous one.
It’s a separate story with whole phrases. In our brain, they form a single “token”. That is, they are not broken down into separate words, unless you specifically think about it. And such tokens also appear in the stream not accidentally—the brain expects them, and as soon as it hears or sees signs that the phrase has appeared, the circle of options narrows down to literally 1-2 possible phrases with such a beginning, and that’s it—one of them is what was said or written.
But the most interesting thing is that recent research has shown: the human brain really works very similar to LLMs. In the study “The neural architecture of language: Integrative modeling converges on predictive processing”, MIT scientists showed that models that better predict the next word also more accurately model brain activity during language processing. Thus, the mechanism used in modern neural networks is not just inspired by cognitive processes, but actually reflects them.
During the experiment, fMRI and electrocorticography (ECoG) data were analyzed during language perception. The researchers found that the best predictive model at the time (GPT-2 XL) could explain almost 100% of the explainable variation in neural responses. This means that the process of understanding language in humans is really built on predictive processing, not on sequential analysis of words and grammatical structures. Moreover, the task of predicting the next word turned out to be key—models trained on other language tasks (for example, grammatical parsing) were worse at predicting brain activity.
If this is true, then the key to fluent reading and speaking in a foreign language is precisely training predictive processing. The more the brain encounters a stream of natural language (both written and spoken), the better it can form expectations about the next word or phrase. This also explains why native speakers don’t notice grammatical errors or can’t always explain the rules—their brain isn’t analyzing individual elements, but predicting entire speech patterns.
So, if you want to speak freely, you don’t just need to learn the rules, but literally immerse your brain in the flow of language—listen, read, speak, so that the neural network in your head gets trained to predict words and structures just as GPT does.
Meanwhile, there’s the theory of predictive coding, asserting that unlike language models predicting only the nearest words, the human brain forms predictions at different levels and time scales. This was tested by other researchers (google Evidence of a predictive coding hierarchy in the human brain listening to speech).
Briefly, the brain works not only to predict the next word, but as if several processes of different “resolutions” are launched. The temporal cortex (lower level) predicts short-term and local elements (sounds, words). The frontal and parietal cortex (higher level) predicts long-term and global language structures. Semantic predictions (meaning of words and phrases) cover longer time intervals (≈8 words ahead). Syntactic predictions (grammatical structure) have a shorter time horizon (≈5 words ahead).
If you try to transfer this concept to the architecture of language models (LLM), you can improve their performance through a hierarchical predictive system. Currently, models like GPT operate with a fixed contextual window—they analyze a limited number of previous words and predict the next, not exceeding these boundaries. However, in the brain, predictions work at different levels: locally—at the level of words and sentences, and globally—at the level of entire semantic blocks.
One of the possible ways to improve LLMs is to add a mechanism that simultaneously works with different time horizons.
Interestingly, can you set up LLM so that some layers specialize in short language dependencies (e.g., adjacent words), and others—in longer structures (e.g., the semantic content of a paragraph)? I google it, and there’s something similar in the topic of “hierarchical transformers”, where layers interact with each other at different levels of abstraction, but still, it’s more for processing super-long documents.
As I understand it, the problem is that for such, you need to train fundamental models from scratch, and probably, this does not work well on unlabelled or poorly labelled content.
Another option is to use multitask learning, so that the model not only predicts the next word, but also tries to guess what the nearest sentence or even the whole paragraph will be about. Again, google search shows that this can be implemented, for example, through the division of attention heads in the transformer, where some parts of the model analyze short language dependencies, and others predict longer-term semantic connections. But as soon as I dive into this topic, my brain explodes. It’s all really complex.
But perhaps, if it’s possible to integrate such a multilevel prediction system into LLMs, they could better understand the context and generate more meaningful and consistent texts, getting closer to how the human brain works.
I’ll be at a conference on the subject in March; will need to talk with the scientists then.

