Decoding Large Language Models and How They Work

The article reviews how Large Language and Multimodal Models process text and images using tokenization, embeddings, and architectures like CNNs and ViTs.

The evolution of natural language processing with Large Language Models (LLMs) like ChatGPT and GPT-4 marks a significant milestone, with these models demonstrating near-human comprehension in text-based tasks. Moving beyond this, OpenAI’s introduction of Large Multimodal Models (LMMs) represents a notable shift, enabling these models to process both images and textual data. This article will focus on the core text interpretation techniques of LLMs — tokenization and embedding — and their adaptation in multimodal contexts, signaling an AI future that transcends text to encompass a broader range of sensory inputs.

Turning raw text into a format that Large Language Models like GPT-4 can understand involves a series of complex and connected steps. Each of these processes — tokenization, token embeddings, and the use of transformer architecture — plays a critical role in how these models understand and generate human language.

Tokenization: The Gateway To Understanding Text

The first step in processing language data for LLMs is tokenization. It is the process where text is segmented into smaller units, known as tokens. This step is critical in breaking down complex language structures into elements that a machine-learning model can process and understand.

Different Approaches to Tokenization

Tokenization can take various forms, each with its unique advantages and challenges.

Character-level tokenization is the simplest form, where text is divided into individual characters. Its straightforward nature, however, leads to longer sequences of tokens, which can be less efficient for processing. On the other end of the spectrum is word-level tokenization, which splits text into words. This method is intuitive but can struggle with large vocabularies and often falters when encountering new or unknown words.

A more balanced approach is found in subword tokenization, exemplified by Byte Pair Encoding (BPE). BPE begins with a foundational vocabulary consisting of all unique characters in a training corpus, ensuring every word can be broken down into these basic units. The core of BPE lies in its method of frequency analysis and pair merging. It iteratively scans the corpus to identify the most frequently occurring pairs of characters or tokens. These pairs are then merged to form new tokens. This step is crucial as it enables the model to recognize and consolidate common pairings, treating them as single units in subsequent processing. The process continues, with each iteration merging the next most frequent pair.

As the vocabulary evolves with each iteration, new tokens (merged pairs) are added. The process repeats until the vocabulary reaches a predefined size or the desired level of granularity. The final vocabulary size strikes a balance between being detailed enough to represent complex words and manageable enough to be processed efficiently.

When tokenizing new text, the algorithm employs this refined vocabulary. It starts by searching for the largest possible tokens in the vocabulary. If no match is found, it breaks the text down into smaller units, continuing until it finds suitable matches. This method ensures that common words and phrases are tokenized into fewer, larger tokens, enhancing processing speed, while less common phrases are further broken down for accuracy.

BPE’s efficiency in representing common phrases, its capability in handling rare and unknown words through subword units, and its balanced approach make it particularly effective for processing languages with large vocabularies or intricate word formations. This method exemplifies a sophisticated approach to preparing data for processing by LLMs, contributing significantly to their understanding of human language.

I’ll walk through an example of how BPE works, along with some code after, to make this concept more concrete.

Let’s use the sentence “This is an example” as an example. Initially, BPE breaks down the sentence into its basic units. Rather than whole words, it starts with characters or small groups of characters. Thus, “This is an example” would first be tokenized into each individual character like ‘T,’ ‘h,’ ‘i,’ ‘s,’ and so on.

BPE then analyzes the frequency of character pairs in the training data and begins merging the most common pairs. If, for instance, ‘Th,’ ‘is, ” an’, ‘ ex,’ and ‘ample’ are frequent pairs in the training data, BPE will merge these into single tokens. This process is iteratively applied, resulting in the sentence being represented by fewer, larger tokens such as ‘Th,’ ‘is,’ ‘an,’ ‘ex,’ and ‘ample.’

This transformation of the sentence from a sequence of individual characters to larger, more meaningful units reduces the total number of tokens needed to represent the text, maintaining efficiency without losing meaning. Additionally, BPE’s methodology is adept at handling new or rare words. For a word not seen during training, like “exemplary,” BPE can break it down into known subwords or characters, such as ‘ex,’ ‘am,’ ‘pl,’ ‘ar,’ and ‘y.’ This capability makes BPE particularly effective in dealing with unknown words by approximating them using its existing vocabulary of subwords.

I will also show you what a simplified representation of BPE would look like in code.

Segment 1: Initializing Vocabulary From a Corpus

This segment prepares the initial vocabulary from a sample text corpus. Each word is split into characters, with a special end-of-word token </w> appended. This token helps distinguish between words that end in the same characters but are different words.

corpus ="This is an example. This is another example." vocab = {} # Preprocessing and initializing vocabulary forwordincorpus.split(): word = ' '.join(list(word)) + ' </w>' if word in vocab: vocab[word] += 1 else: vocab[word] = 1


  • The corpus string contains the sample text.
  • The vocab dictionary will hold the initial vocabulary.
  • The loop splits each word into characters and adds an end-of-word token.
  • Each unique word (now a sequence of characters) is added to vocab with its frequency.

Segment 2: Defining Helper Functions

These functions are critical for BPE operations. They find the common pairs in the training data and merge them, as we discussed in the example above.


importre, collections
pairs = collections.defaultdict(int)
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols)-1):
pairs[symbols[i],symbols[i+1]] += freq
return pairs
defmerge_vocab(pair, v_in):
v_out = {}
bigram = re.escape(' '.join(pair))
p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
for word in v_in:
w_out = p.sub(''.join(pair), word)
v_out[w_out] = v_in[word]
return v_out


  • get_stats: Calculates the frequency of each pair of adjacent tokens in the vocabulary.
  • merge_vocab: Merges the most frequent pair into a single token. It uses regular expressions to find and merge these pairs in the vocabulary.

Segment 3: Iterative Merging of Pairs

This segment performs the core of BPE: iteratively merging the most frequent pairs of characters in the vocabulary.


# Number of merges
num_merges =10
  pairs = get_stats(vocab)
  if not pairs:
  best = max(pairs, key=pairs.get)
  vocab = merge_vocab(best, vocab)
print("Vocabulary after BPE:")


  • The loop runs for a specified number of iterations (num_merges).
  • In each iteration, it identifies and merges the most frequent pair.
  • After all iterations, the updated vocabulary reflects the results of BPE.

Segment 4: Tokenizing New Text

This final segment demonstrates how to use the BPE vocabulary to tokenize a new sentence.

defbpe_tokenize(text, vocab): tokens = [] for word in text.split(): word = ' '.join(list(word)) + ' </w>' while len(word) > 1: possible_tokens = [token for token in vocab if token in word] if not possible_tokens: break longest_token = max(possible_tokens, key=len) tokens.append(longest_token) word = word.replace(longest_token, '', 1) return tokens new_sentence ="This example is different." tokenized_sentence = bpe_tokenize(new_sentence, vocab) print("\nTokenized sentence:") print(tokenized_sentence)


  • bpe_tokenize function tokenizes a new sentence using the BPE vocabulary.
  • It splits the sentence into words and then each word into characters, searching for the longest tokens from the BPE vocabulary that match.
  • The function outputs the tokenized form of the new sentence.

Together, these code segments demonstrate the creation of a BPE vocabulary from a corpus and its application in tokenizing new text.

Tokenization is more than just a preliminary step; it lays the groundwork for deeper language comprehension. By converting text into tokens, we prepare the data for the subsequent phase of embedding, where the true semantic processing begins.

Transitioning to Meaning: The Significance of Token Embeddings

Once tokenized, the next step for the text is token embeddings, a process that captures the semantic meanings of words based on context. The process involves initializing embeddings, training the model to adjust these embeddings, and ensuring that semantically similar tokens are close in the embedding space.

Embedding Initialization

Each token (word or subword) in the model’s vocabulary is associated with an embedding, which is a high-dimensional vector. These embeddings are typically initialized randomly. This randomness provides a starting point for the training process to adjust and refine these vectors.

Training for Contextual Adjustment

During training, the model is exposed to large volumes of text data. It learns to adjust the embeddings based on the context in which each token appears. This learning process involves backpropagation and optimization algorithms (like Stochastic Gradient Descent or Adam) to iteratively adjust the embeddings to minimize the prediction error of the model.

Semantic Relationships

The model aims to adjust the embeddings so that tokens used in similar contexts have embeddings that are close to each other in the vector space. This closeness is typically measured using cosine similarity. This process creates a semantic map where words with similar meanings or usage are located near each other in the high-dimensional space.

Illustrative Python Code

I’ll walk through the process of initializing token embeddings and simulating a basic training process to adjust these embeddings using a simplified example.


# Vocabulary and random initialization of embeddings
vocab = ["cat","dog","animal","pet","feline","canine"]
embedding_dim = 300 # Typical dimensions range from 128 to 4096
embeddings = {word: np.random.randn(embedding_dim)forwordinvocab}
# Function to simulate training adjustment
defadjust_embeddings(embeddings, word_pairs, learning_rate=0.01):
for word1, word2 in word_pairs:
if word1 in embeddings and word2 in embeddings:
# Simple adjustment: move embeddings closer for related words
embedding_diff = embeddings[word1] - embeddings[word2]
embeddings[word1] -= learning_rate * embedding_diff
embeddings[word2] += learning_rate * embedding_diff
# Simulated training data (word pairs that are contextually related)
related_word_pairs = [("cat","feline"), ("dog","canine"), ("cat","pet"), ("dog","pet")]
# Simulating training
for _ in range(1000): # Number of iterations
adjust_embeddings(embeddings, related_word_pairs)
# Check the adjusted embeddings
print("Adjusted Embeddings for 'cat' and 'dog':")
print("Cat:", embeddings["cat"][:5]) # Displaying first 5 dimensions
print("Dog:", embeddings["dog"][:5])

Explanation of the Code

  • Embedding initialization:The code defines a vocabulary and initializes embeddings for each word randomly, setting the embedding dimension to 300 for simplicity.
  • Training simulation:Using the adjust_embeddings function, the code simulates training by adjusting embeddings of related word pairs (e.g., “cat” and “feline”) to be closer in the vector space. This is achieved by iteratively reducing the distance between embeddings of each pair, mimicking how LLMs refine token embeddings based on context.
  • Outcome: After numerous iterations, the embeddings for contextually related words are adjusted to be nearer to each other in the embedding space, representing a basic form of contextual learning in LLMs.

Advancing to Contextual Interpretation: The Power of Transformer Architecture

Token embeddings provide Large Language Models (LLMs) with a deep understanding of language context and meaning. Transformers build on this by processing these embeddings in parallel, using self-attention mechanisms to discern intricate word relationships. This synergy enables LLMs to interpret and generate text with a nuanced comprehension of language semantics and structure.

Transformer Architecture

Transformers are designed with a layered architecture, where each layer contributes to processing the input data. The core components of each layer are the self-attention mechanism and a feed-forward neural network. What sets Transformers apart from earlier models like RNNs and LSTMs is their ability to process all tokens in an input sequence simultaneously. This parallel processing approach not only boosts computational efficiency but also enables the model to capture complex dependencies and relationships within the data more effectively.

Self-Attention Mechanism

At the heart of each Transformer layer is the self-attention mechanism. This mechanism computes what are known as attention scores for each token in the input sequence. These scores are a measure of how much focus or “attention” the model should allocate to other tokens in the sequence when processing a particular token.

The self-attention mechanism allows the Transformer to dynamically adjust the influence of each token in the sequence based on the entire sequence itself. This is crucial for understanding context, as it enables the model to interpret each token not in isolation but in relation to others.

Moreover, Transformers employ what is termed as multi-head attention. This means that the self-attention process is replicated multiple times in parallel within each layer. Each replication, or “head,” can potentially focus on different aspects of the token relationships, such as syntactic or semantic connections. By doing so, the model can capture a richer and more nuanced understanding of the text.

Positional Encodings

A unique challenge in the design of Transformers is their lack of inherent sequence processing capability – a feature that sequential models like RNNs inherently possess. To address this, Transformers utilize positional encodings. These are added to the token embeddings to provide the model with information about the position of each token within the sequence.

The positional encodings are often generated using sinusoidal functions. This method ensures that each position in the sequence receives a unique encoding, allowing the model to differentiate between tokens based on their order in the sequence. This positional information is crucial for the model to understand the flow and structure of language.

Layer-By-Layer Processing

In a Transformer model, each layer processes the input sequence, progressively transforming and refining the representation of each token. This transformation is informed by the context provided by the self-attention mechanism and the subsequent operations of the feed-forward network within the layer.

As the input data passes through successive layers of the Transformer, the representations of the tokens become increasingly refined and enriched with contextual information. This layer-by-layer processing allows for a sophisticated understanding and generation of language, enabling the model to handle complex language tasks with remarkable efficiency and effectiveness.

Code Example and Explanation

Below is a simplified Python code example to illustrate a basic version of the Transformer’s self-attention mechanism.


return np.exp(x) / np.sum(np.exp(x), axis=0)
defscaled_dot_product_attention(Q, K, V):
# Compute dot product of Q and K, scaled by square root of dimension of K
matmul_qk =, K.T) / math.sqrt(K.shape[1])
# Apply softmax to get attention weights
attention_weights = softmax(matmul_qk)
# Multiply by V to get output
output =, V)
return output, attention_weights
# Example token embeddings (randomly initialized)
token_embeddings = np.random.rand(3, 64) # 3 tokens, 64-dimensional embeddings
# Simulating Query (Q), Key (K), and Value (V) matrices
Q = token_embeddings
K = token_embeddings
V = token_embeddings
# Applying self-attention
attention_output, attention_weights = scaled_dot_product_attention(Q, K, V)
print("Attention Output:", attention_output)
print("Attention Weights:", attention_weights)

Explanation of the Code

  • Softmax function: Used to normalize attention scores, ensuring they sum to 1.
  • Scaled dot-product attention:

    • The core of the self-attention mechanism, this function computes the attention scores.
    • It scales the dot product of the query (Q) and key (K) matrices by the square root of the key’s dimension, which helps stabilize gradients during training.
    • The softmax function is applied to these scores, and the resulting weights are used to create a weighted sum of the values (V), representing the output of the attention mechanism.
  • Token Embeddings:

    • For illustration, we use randomly initialized embeddings. In practice, these would be learned through training.
    • We treat the same set of embeddings as queries, keys, and values for simplicity.
  • Output:

    • The attention_output represents the transformed embeddings after applying self-attention.
    • The attention_weights provide insight into how much each token attends to the others in the sequence.

This code offers a basic glimpse into how self-attention in Transformers operates. The real-world Transformer models are much more complex, with additional components like layer normalization, residual connections, and a feed-forward network in each layer, all of which are critical for their advanced performance.

Expanding Horizons: From Text-Centric LLMs to the Versatile World of Large Multimodal Models (LMMs)

Bridging from the text-focused realm of LLMs to the more diverse world of Large Multimodal Models (LMMs), we see an expansion in capabilities. While LLMs excel in interpreting and generating text, LMMs extend this proficiency to include other types of data, notably images. This transition marks a significant evolution in model design and functionality. LMMs integrate the text-processing power of Transformers with advanced neural network architectures capable of handling visual data, such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). The synergy of these technologies in LMMs enables them to process and understand a combination of data types, like text and images.

Image Processing in LMMs

CNNs and ViTs in Image Processing

  • Convolutional Neural Networks (CNNs):CNNs are structured with layers of convolutional filters that systematically apply convolution operations on the input image. Each convolution layer extracts features from its input and passes these features to subsequent layers. Early layers tend to capture basic visual features like edges and textures, while deeper layers can identify more complex patterns and objects.

    • Example: In a retail chatbot, a CNN might analyze a product image to identify features like color and design, which can then be matched with inventory data.
  • Vision Transformers (ViTs): A newer approach, ViTs, apply the Transformer architecture to image processing. ViTs divide an image into patches and linearly embed each of them, followed by position encoding. The sequence of patch embeddings is then processed through a series of Transformer layers. ViTs leverage self-attention mechanisms to focus on different parts of the image, capturing both local and global features. For example, in an art analysis application, a ViT could dissect an artwork into patches to understand both the fine details and the overall composition.
  • Feature Vector Transformation:After processing through CNNs or ViTs, the resultant feature vectors represent crucial visual information. These vectors are then transformed to align with the high-dimensional space used for token embeddings in text. This alignment is essential for creating a common ground where both textual and visual information can be compared and related. For example, in a medical diagnosis tool, feature vectors from MRI scans could be combined with clinical notes to enhance diagnostic accuracy.

Unified Representation in LMMs

Combining Modalities

In LMMs, the goal is to create a space where feature vectors from images and token embeddings from text can coexist and be processed together. This coexistence allows the model to perform operations that consider both textual and visual information, enabling a more comprehensive understanding of multimodal content. Unified representations in LMMs enable a range of applications. For example, they can generate descriptive captions for images or answer questions that require insights from both an image and accompanying text. This capability is particularly useful in scenarios like content moderation, where understanding the context of both text and images is crucial.

Cross-Modal Attention in LMMs

Attention Across Modalities

Cross-modal attention mechanisms in LMMs allow the model to focus on specific parts of one modality based on the context provided by another. For instance, when processing a text-image pair, the model might focus on specific regions of the image that are directly relevant to the textual content.

Enhancing Understanding

This cross-modal attention is pivotal for tasks that require an integrated understanding of both textual and visual information. It allows LMMs to perform sophisticated analysis, like interpreting the sentiment of an image based on its visual elements and accompanying text or providing detailed explanations of visual content.

Illustrative Example

I’ll walk through a Python code example to illustrate how a model might process an image and text together. This example is highly simplified and does not represent the complexity of actual LMMs.


# Hypothetical functions to process text and images
# This function would convert text to embeddings
return np.array([0.5, 1.2, 0.7]) # Dummy embeddings
# This function would use a CNN or ViT to extract features
return np.array([1.0, 0.8, 0.3]) # Dummy image features
# Sample text and image
text ="A cat on a mat"
image = "cat_image.jpg" # In practice, this would be the image data
# Process the text and image
text_embeddings = process_text(text)
image_features = process_image(image)
# Combining the embeddings and features
combined_representation = np.concatenate((text_embeddings, image_features))
print("Combined Text and Image Representation:", combined_representation)


  • Process text and image: The process_text and process_image functions are placeholders for more complex operations in real LMMs. They convert text and images into embeddings and feature vectors, respectively.
  • Combining representations: The embeddings and feature vectors are concatenated to create a unified representation, allowing the model to process and relate information from both text and image.
  • Application: In a real LMM, this combined representation would then be processed further, possibly through a Transformer-like architecture, to perform tasks like image captioning or visual question answering.

The remarkable pace of development in Large Language Models (LLMs) and Large Multimodal Models (LMMs) paves the way for a future brimming with unprecedented possibilities. As we stand at the cusp of this transformative era, the potential of LLMs and LMMs to revolutionize various sectors — from education to industry — is both immense and inspiring.

You may also like