Understanding the Transformer Architecture from Scratch

All Notes

Technology

Colorful abstract pattern resembling digital waves with intricate texture in blue and purple hues.

Understanding the Transformer Architecture from Scratch

The world of Artificial Intelligence has been revolutionized by a single architectural innovation: the Transformer. If you’ve interacted with large language models (LLMs) like ChatGPT, Bard, or even used advanced translation services, you’ve witnessed the power of the Transformer in action. Before its advent, recurrent neural networks (RNNs) and their variants like LSTMs and GRUs were the state-of-the-art for sequence processing. While effective, they struggled with parallelization and capturing very long-range dependencies efficiently.

In 2017, a team of researchers at Google published a seminal paper titled “Attention Is All You Need” [1]. This paper introduced the Transformer, an architecture that completely eschewed recurrence and convolutions, relying solely on an attention mechanism. This radical shift unlocked unprecedented capabilities, particularly in natural language processing (NLP), setting the stage for the generative AI explosion we’re experiencing today.

This post aims to demystify the Transformer architecture, breaking it down into its fundamental building blocks. We’ll start from the ground up, explaining each component and why it’s necessary, allowing you to grasp the elegant simplicity behind its immense power.

The Encoder-Decoder Framework: A High-Level View

At its core, the original Transformer is an encoder-decoder model. This structure is common in sequence-to-sequence (seq2seq) tasks, such as machine translation (e.g., English to French).

Encoder: Processes the input sequence (e.g., an English sentence) and transforms it into a rich, contextualized representation. Think of it as generating a sophisticated “summary” or “understanding” of the input.
Decoder: Takes the encoder’s output representation and generates the output sequence (e.g., the French translation) one token at a time, based on its own previous outputs and the encoder’s context.

Both the encoder and decoder are composed of stacked identical layers, allowing the model to build up increasingly abstract representations.

The Encoder Block: Understanding the Input

Let’s dissect a single encoder layer. Each encoder layer receives a list of input embeddings, which represent the tokens (words or sub-words) of the input sequence.

1. Input Embedding and Positional Encoding

Neural networks operate on numerical data. So, the very first step is to convert discrete input tokens (like words) into continuous numerical vectors. This is done via an embedding layer. Each unique word in our vocabulary is mapped to a dense vector of a fixed dimension (e.g., 512).

However, unlike RNNs which process words sequentially, the Transformer’s attention mechanism processes all words in a sequence simultaneously. This means it inherently loses information about the order of words. For example, “dog bites man” and “man bites dog” would have the same set of word embeddings but completely different meanings.

To reintroduce this crucial sequential information, the Transformer adds Positional Encoding to the input embeddings. These are vectors that carry information about the position of each token in the sequence.

The original paper uses fixed sinusoidal functions to generate these positional encodings:

$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})$ $PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})$

Where:

$pos$ is the position of the token.
$i$ is the dimension within the positional encoding vector.
$d_{model}$ is the dimensionality of the embedding.

By adding these unique positional vectors to the word embeddings, the model gains awareness of token order without relying on recurrence. This allows it to understand relationships between words based on their relative and absolute positions.

2. Multi-Head Self-Attention (MHSA)

This is the heart of the Transformer. It’s how the model decides which parts of the input sequence are most relevant for processing a given token.

What is Self-Attention?

Self-attention allows each token in the input sequence to “look” at all other tokens in the same sequence to compute a new representation that incorporates context. For example, when processing the word “bank” in “river bank,” self-attention would allow the model to recognize that “river” is highly relevant, helping it disambiguate “bank” from “financial bank.”

The core mechanism involves three learned weight matrices:

Query (Q): Represents the current token (what I’m looking for).
Key (K): Represents all other tokens (what I have available).
Value (V): Represents the content of all other tokens (the actual information to extract).

Think of it like querying a database: you have a Query (what you want to find), you match it against Keys (indices in the database), and if there’s a match, you retrieve the corresponding Values (the data itself).

The calculation for Scaled Dot-Product Attention is:

$\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$

Where:

$Q$, $K$, $V$ are matrices derived by multiplying the input embeddings by the respective weight matrices $W_Q, W_K, W_V$.
$d_k$ is the dimension of the key vectors. This scaling factor prevents the dot products from becoming too large, which could push the softmax into regions with extremely small gradients, hindering learning.

The result of $QK^T$ is an “attention score” matrix, where each element $(i, j)$ indicates how much token $i$ should attend to token $j$. The softmax normalizes these scores into probability distributions. Finally, multiplying by $V$ means we’re taking a weighted sum of the Value vectors, where the weights are the attention probabilities.

Why “Multi-Head”?

Instead of performing one self-attention computation, the Transformer performs several of them in parallel. Each parallel attention layer is called a “head.”

Diverse Perspectives: Each head learns a different set of Q, K, V weight matrices. This allows the model to attend to different parts of the input sequence, or to focus on different aspects of the relationships between tokens. For example, one head might focus on grammatical relationships, while another focuses on semantic similarities.
Rich Representations: After computing attention for each head, their outputs are concatenated and then linearly projected back into the original $d_{model}$ dimension. This gives the model a richer, more diverse understanding of the input sequence.

3. Add & Norm (Residual Connections and Layer Normalization)

After the Multi-Head Self-Attention sub-layer, the output is passed through two critical sub-components:

Residual Connections (Add)

Inspired by ResNet [2], a residual connection simply adds the input of a sub-layer to its output. If $X$ is the input to a sub-layer and $SubLayer(X)$ is its output, the output of the residual connection is $X + SubLayer(X)$. This helps in:

Training Deep Networks: It mitigates the vanishing gradient problem, allowing gradients to flow more easily through many layers.
Preserving Information: It ensures that information from earlier layers is not lost as it propagates through the network.

Layer Normalization (Norm)

After the residual connection, the result is passed through a layer normalization step [3]. Unlike Batch Normalization which normalizes features across the batch dimension, Layer Normalization normalizes the activations across the feature dimension for each individual sample. This helps in:

Stabilizing Training: It makes the training process more stable and faster by ensuring that the inputs to subsequent layers have a consistent distribution.
Independence from Batch Size: It’s particularly useful in NLP where sequence lengths can vary, and batch sizes might be small.

So, for any sub-layer $L$, the operation is effectively $Norm(X + L(X))$.

4. Feed-Forward Network (FFN)

The output of the “Add & Norm” from the attention layer is then passed through a simple position-wise fully connected feed-forward network. This is essentially a two-layer multi-layer perceptron (MLP) with a ReLU activation in between:

$FFN(x) = \max(0, xW_1 + b_1)W_2 + b_2$

This FFN is applied independently and identically to each position in the sequence. It’s a way for the model to process the information gathered by the attention mechanism and transform it into a more useful representation for the next layer. It adds non-linearity and allows the model to learn complex patterns.

Each encoder layer repeats the “Multi-Head Self-Attention” followed by “Add & Norm,” then “Feed-Forward Network” followed by “Add & Norm.” The original Transformer used 6 such encoder layers stacked on top of each other.

The Decoder Block: Generating the Output

The decoder is similar to the encoder but with a crucial difference: it must generate the output sequence one token at a time, ensuring that the prediction for the current token only depends on the previously generated tokens and the input sequence.

Each decoder layer also takes two inputs:

The output from the previous decoder layer (or the target sequence embeddings for the first layer).
The output from the top encoder layer.

Let’s break down a decoder layer:

1. Input Embedding and Positional Encoding (for Target Sequence)

Similar to the encoder, the target sequence tokens (e.g., previously generated words in the French translation) are first converted into embeddings, and positional encodings are added to preserve order.

2. Masked Multi-Head Self-Attention

This is the first attention sub-layer in the decoder. It’s almost identical to the encoder’s self-attention, but with one critical modification: masking.

During training, to prevent the decoder from “cheating” by looking at future tokens in the target sequence, a look-ahead mask is applied to the attention scores. This mask effectively sets the attention scores for future positions to negative infinity, so their softmax probabilities become zero. This ensures that the prediction for a given position depends only on the known outputs at previous positions.

The calculation is still $\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$, but $QK^T$ is masked before the softmax.

After this masked attention, there’s another “Add & Norm” step.

3. Encoder-Decoder Attention (Cross-Attention)

This is the second attention sub-layer in the decoder, and it’s where the decoder finally interacts with the encoder’s output. It’s often referred to as “cross-attention.”

Here’s how it works:

Queries (Q) come from the output of the previous masked self-attention layer in the decoder.
Keys (K) and Values (V) come from the output of the top encoder layer.

This mechanism allows every position in the decoder’s output sequence to attend to all positions in the input sequence. This is crucial for tasks like machine translation, where the generated output needs to be highly aligned with the meaning of the input. For instance, when translating “bank” from English to French, this layer helps the decoder decide whether to attend to “river” or “money” from the input English sentence to pick the correct French word.

Again, this is followed by an “Add & Norm” step.

Just like in the encoder, the output of the encoder-decoder attention (after Add & Norm) is passed through a position-wise Feed-Forward Network. This allows the decoder to process and transform the combined information from both the input and the previously generated target sequence.

This is followed by the final “Add & Norm” step for the decoder layer. The original Transformer also used 6 such decoder layers.

5. Linear and Softmax Layer (Output)

The output from the final decoder layer (a continuous vector for each token position) is then passed through a Linear layer. This layer projects the vector into a higher-dimensional space, specifically the size of the model’s vocabulary.

Finally, a Softmax layer converts these raw scores into probability distributions over the entire vocabulary. The token with the highest probability is chosen as the next word in the output sequence. During inference (when generating text), this chosen token then becomes part of the input to the next step of the decoder.

Training the Transformer

Training a Transformer typically involves:

Loss Function: Usually cross-entropy loss, comparing the predicted token distribution with the actual next token.
Optimizer: Often Adam with a custom learning rate schedule that includes a “warm-up” phase (gradually increasing learning rate) followed by a decay phase. This helps in stable training.
Teacher Forcing: During training, the decoder is fed the actual previous token from the target sequence (rather than its own prediction) at each step. This significantly speeds up training and makes it more stable.

Key Advantages of the Transformer

Parallelization: The most significant advantage. Since there are no recurrent connections, all tokens in a sequence can be processed in parallel during encoding (and masked parallelism in decoding). This leads to much faster training times compared to RNNs, which process sequentially.
Long-Range Dependencies: The attention mechanism can directly attend to any part of the sequence, regardless of distance. This makes it highly effective at capturing long-range dependencies that often troubled RNNs due to vanishing/exploding gradients.
Contextual Embeddings: The self-attention mechanism allows each token’s representation to be a dynamic, context-aware embedding, rather than a fixed one. The meaning of “bank” changes based on “river” or “money” in the input, and the Transformer explicitly learns this through attention.
Transfer Learning: Transformers are incredibly effective for pre-training on massive datasets (e.g., large text corpora) and then fine-tuning for specific downstream tasks. This paradigm shift, exemplified by models like BERT, GPT, and T5, is a major reason for their widespread success.

Limitations and Considerations

While revolutionary, Transformers are not without their considerations:

Quadratic Complexity: The standard self-attention mechanism has a computational and memory complexity that scales quadratically with the sequence length ($O(N^2)$). This makes processing very long sequences computationally expensive, leading to research in efficient attention mechanisms (e.g., Reformer, Linformer, Performer, Longformer).
Memory Footprint: The number of parameters in Transformer models can be enormous, leading to significant memory requirements, especially for very large models (billions of parameters).
Inductive Bias: Unlike CNNs (which have a natural inductive bias for local features and translation equivariance) or RNNs (which have a bias for sequence order), vanilla Transformers lack strong inherent inductive biases. They rely more heavily on massive amounts of data to learn these relationships.

The Transformer’s Enduring Impact

The Transformer architecture, initially conceived for machine translation, has proven to be incredibly versatile. It’s the foundational block for:

Large Language Models (LLMs): GPT-3, GPT-4, BERT, T5, LLaMA, PaLM, and countless others are all built upon Transformer principles, often using only the decoder (for generative models like GPT) or only the encoder (for understanding models like BERT).
Vision Transformers (ViT): Demonstrating its applicability beyond NLP, Transformers are now state-of-the-art in computer vision tasks, processing image patches as sequences.
Speech Recognition, Drug Discovery, and more: Its ability to model complex dependencies in sequence data makes it a powerful tool across diverse domains.

The journey from “Attention Is All You Need” to models that can write poetry, code, and answer complex questions has been remarkably swift, largely thanks to the elegant and scalable design of the Transformer. Understanding its core components provides a solid foundation for comprehending the current landscape of AI and the exciting developments yet to come.

References

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762. https://arxiv.org/abs/1706.03762

[2] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770-778. https://arxiv.org/abs/1512.03385

[3] Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv preprint arXiv:1607.06450. https://arxiv.org/abs/1607.06450