Demystifying LLMs: The Core Technology Behind ChatGPT

A recent tweet from Andrej Karpathy made me realize that it's been a while since I talked about Large Language Models (LLMs). Also, in recent workshops, I've noticed the importance of dedicating more time to explain what Language Models are, both large and small (LLM or SLM).

One of my principles about AI is that "to truly understand it, you need to first grasp it (and maybe try it out)." So, I want to simplify and explain what a language model is and why it is so relevant, especially now that we are talking about multimodality— the model's ability to simultaneously process multiple types of input, such as text, images, and sounds.

What is a Language Model (LM)?

An LM (Language Model) is an artificial intelligence model that works with language. How does it work? For example, if we provide the model with a sentence like "Today is a beautiful", the model might complete it with the word "day" based on the context of the previous words. It takes the words we provide and transforms them into tokens, small pieces of information (like single words or parts of them). Based on these tokens and input, the model tries to predict the following tokens using what it has learned.

Imagine you're telling a story, and you stop halfway through a sentence: a language model tries to predict the next word based on all the previous ones, just like we do.

Today's most well-known language models are built on the Transformers architecture, which uses an attention mechanism to better understand the context and relationships between words. These models are autoregressive, generating one token at a time, using the one just predicted to create the next.

This process is not limited to words: if we can transform any problem into a sequence of tokens, the model can help us solve it. This forms the foundation of many generative AI applications.

What is an LLM?

If we use a lot of data (comparable to millions or billions of books) to train an LM, we get an LLM (“Large” Language Model). An LLM is a language model with an extremely high number of parameters—hundreds of billions—trained on a massive corpus of data that enables it to understand complex texts, summarize, translate, perform logical steps, and much more. Examples of LLMs include OpenAI's GPT-4, Google's Gemini 1.5, and Anthropic's Claude 3, among others.

These huge models require powerful data centers and advanced GPUs to operate. However, there are also smaller versions called SLMs (Small Language Models), which are a reduced version of LLMs. These models are lighter and can run on less powerful devices, such as computers or smartphones, but sacrifice some precision and knowledge. I discuss this here: Why Personal AIs Will Change Everything again.

Do LLMs Only Deal with Words?

No, there's much more to it. LLMs are not limited to processing text. If we can reduce our problem to a sequence of tokens, regardless of the type of data (text, numbers, symbols), we can assign it to an LLM. This is a fundamental concept that can change how we view the use of LLMs in our activities.

🔴 The Red Pill: For Those Who Want to Go Technical

In technical terms, a token is the smallest unit of meaning that a language model understands. It can be a word, part of a word, or a symbol (such as numbers, codes, or chemical elements). LLMs work through next token prediction: they read a sequence of tokens and try to predict the next one. If we can break down any problem into a sequence of tokens to interpret, we can apply an LLM.

We "train" the model to understand those tokens to do this. This means LLMs can work on any problem if we can translate it into a sequence of discrete tokens. This is where the idea of convergence of many issues into this paradigm comes from: the real challenge is no longer the type of problem but the vocabulary of the tokens used and their meaning in the specific context.

🔵 The Blue Pill: Non-Technical Version

If the technical part gave you a headache, simply imagine an LLM as a skilled translator trying to complete a sequence of words, symbols, or numbers, regardless of the content. If the problem can be described as a sequence accompanied by descriptive text, the LLM can help you solve it.

The Core: Predicting the Next Token

An LLM focuses on predicting the next token. However, the key lies in the meaning of the tokens: they can be words, numbers, chemical symbols, or other types of data. An LLM can be used for any problem that can be represented as a sequence of information, depending on the context and the knowledge it has learned.

So Why Is It Called a “Language” Model?

The name comes from these models being initially developed to work with natural language. Only later did we realize the same technology could be applied to other domains. This led to the birth of Multimodal Language Models, models trained on texts, images, sounds, and capable of reasoning across multiple types of content simultaneously.

As Andrej Karpathy explained, a more suitable name for these models would be Autoregressive Transformers, but it is certainly less intuitive.

Is ChatGPT an LLM?

No! ChatGPT is the interface OpenAI developed to allow you to chat with an LLM like GPT-4 or GPT-4-mini (the Small version). This interface also provides search results from the internet, and many other features the application offers. Similarly, Claude is the application that allows you to use Anthropic's models like Sonnet or Opus.

So What?

LLMs are not limited to generating text; they can model and relate any data flow to language. This means their use goes beyond simply creating content, becoming a tool for predicting, connecting, and automating complex processes based on various data types.

This technology allows us to interact with AI using different types of content, such as text, images, or voice, without complicated conversions. For example, you can upload a photo of an object and ask the AI to describe it, or provide a vocal description and get a written response. The increasing maturity of LLMs will show us that many problems can be solved simply by predicting the next token. Andrej concludes by questioning what will happen to all the current deep learning frameworks since everything seems to be converging on these technologies, even though, let me add, we are still in an immature phase of generative AI and certainly far from the idea of Artificial General Intelligence (AGI).

What do you think? See you soon! Massimiliano