Transformer Architecture Basics: From Attention to Modern AI (Lecture 15)

In this lecture, we’ll introduce the Transformer architecture, which has become the foundation of modern AI models like GPT and BERT.
Unlike RNNs or LSTMs that process sequences step by step, Transformers rely entirely on attention mechanisms and allow parallel processing, making them both faster and more effective.


Table of Contents

{% toc %}


1) Why Transformers?

Traditional sequence models like RNNs and LSTMs process data sequentially, making training slow and prone to long-term dependency issues.

Transformers solve this by:

  • Using self-attention to capture relationships between words
  • Allowing parallel computation across the entire sequence
  • Scaling well to large datasets and modern hardware

2) Core Components of the Transformer

  1. Self-Attention
    Each word attends to other words in the sentence to capture context.
    Example: “The cat sat on the mat because it was tired.” → “it” refers to “cat.”

  2. Multi-Head Attention
    Multiple attention heads focus on different types of relationships simultaneously (syntax, semantics, etc.).

  3. Positional Encoding
    Since Transformers don’t inherently understand order, positional encodings (sinusoidal functions) are added to embeddings to preserve sequence information.

  4. Feed-Forward Network (FFN)
    A fully connected layer applied to each position independently for richer representation.

  5. Residual Connections + Layer Normalization
    Help stabilize and speed up training.


3) Encoder-Decoder Structure

  • Encoder: Reads the input sequence and generates a context representation.
  • Decoder: Uses the encoder’s output and attention to generate the target sequence.

For tasks like classification or embeddings, only the encoder is often used.
For machine translation, both encoder and decoder are necessary.


4) Intuitive Analogy

Think of a meeting discussion:

  • Self-attention = deciding which participant’s statement is most relevant
  • Multi-head attention = multiple note-takers, each focusing on different perspectives
  • Positional encoding = marking who spoke first, second, and so on

5) Hands-On: Simple Transformer Encoder in TensorFlow

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import tensorflow as tf
from tensorflow.keras import layers

# Input (batch=1, tokens=5, dimension=16)
x = tf.random.normal((1, 5, 16))

# Multi-Head Attention
attn = layers.MultiHeadAttention(num_heads=2, key_dim=16)
out1 = attn(x, x)  # Self-Attention

# Residual + LayerNorm
out1 = layers.LayerNormalization()(x + out1)

# Feed-Forward Network
ffn = tf.keras.Sequential([
    layers.Dense(64, activation="relu"),
    layers.Dense(16)
])
out2 = ffn(out1)

# Residual + LayerNorm
encoder_output = layers.LayerNormalization()(out1 + out2)

print("Encoder output shape:", encoder_output.shape)

Expected Output:

1
Encoder output shape: (1, 5, 16)

6) Advantages of Transformers

  • Parallel training → much faster than RNNs
  • Handles long dependencies via self-attention
  • Versatility → applied to NLP, vision, speech, and more

7) Key Takeaways

  • Transformers are built entirely on attention mechanisms.
  • The architecture consists of encoder-decoder blocks.
  • They are the backbone of state-of-the-art models like GPT and BERT.

8) What’s Next?

In Lecture 16, we’ll explore BERT architecture and pretraining techniques, one of the most widely used Transformer-based models in NLP.