Attention Mechanism Basics: Understanding Query, Key, and Value (Lecture 14)

In this lecture, we’ll explore the Attention Mechanism, one of the most impactful innovations in deep learning and Natural Language Processing (NLP).
The key idea is simple: instead of treating all words equally, the model focuses on the most relevant words to improve context understanding.

{% toc %}

1) Why Attention Matters

Traditional sequence models like RNN, LSTM, and GRU struggle with long sentences, often forgetting earlier information.
Example:

“I watched a movie with a friend yesterday, had dinner, and read a book. The movie was really fun.”

To correctly interpret “The movie,” the model must recall the earlier mention of “movie”, which RNNs often fail to do.
Attention solves this by assigning higher weights to important words.

2) Core Idea of Attention

Attention relies on three key components:

Query (Q): What we’re looking for
Key (K): What each word represents
Value (V): The actual information carried

Formula

1
2

Attention(Q, K, V) = softmax(QKᵀ / √d\_k) V

QKᵀ → similarity between words
softmax → converts similarity into probability weights
Final output is a weighted sum of Values

3) Intuitive Example

Sentence: “The cat sat on the mat because it was tired.”
Here, the word “it” refers to “cat”.
Attention assigns higher weight to “cat” when interpreting “it.”

4) Hands-On: Implementing Simple Attention (TensorFlow)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import tensorflow as tf

# Example input (batch=1, words=5, embedding_dim=8)
x = tf.random.normal(shape=(1, 5, 8))

# Define Q, K, V projections
Wq = tf.keras.layers.Dense(8)
Wk = tf.keras.layers.Dense(8)
Wv = tf.keras.layers.Dense(8)

Q = Wq(x)
K = Wk(x)
V = Wv(x)

# Compute attention scores
scores = tf.matmul(Q, K, transpose_b=True) / tf.math.sqrt(tf.cast(8, tf.float32))
weights = tf.nn.softmax(scores, axis=-1)

# Output vector
output = tf.matmul(weights, V)

print("Attention Weights:", weights.numpy())
print("Output Shape:", output.shape)

Sample Output:

1
2
Attention Weights: [[0.21 0.18 0.25 0.19 0.17], ...]
Output Shape: (1, 5, 8)

This shows how each word distributes its focus across other words.

5) Advantages of Attention

Handles long dependencies – connects distant words easily
Parallelizable – faster training than RNNs
Interpretability – we can visualize attention weights to understand focus

6) Key Takeaways

Attention introduces the concept of “focus” in neural networks
Query, Key, and Value drive context-aware understanding
TensorFlow example demonstrated how attention weights distribute across words

7) What’s Next?

In Lecture 15, we’ll dive into the Transformer architecture, which builds entirely on attention and powers modern models like GPT and BERT.

Attention Mechanism Basics: Understanding Query, Key, and Value (Lecture 14)#

Table of Contents#

1) Why Attention Matters#

2) Core Idea of Attention#

Formula#

3) Intuitive Example#

4) Hands-On: Implementing Simple Attention (TensorFlow)#

5) Advantages of Attention#

6) Key Takeaways#

7) What’s Next?#