Attention Mechanism in Transformer

Input Sentence:

"I am Fine."

We will walk through scaled dot-product attention as used in the Transformer architecture (Vaswani et al., 2017).

Step 0: Preliminaries and Assumptions

ElementValue
Sentence"I am Fine." (3 tokens)
Vocabulary size10,000
Embedding size (d_model)4
Number of tokens (seq_len)3
Attention head dimension (d_k)4

Step 1: Tokenization and Input Embeddings

Tokens: [I, am, Fine.]

Embedding matrix E shape: 10000 × 4

Assumed embeddings:

I     → [1.0, 0.0, 1.0, 0.0]
am    → [0.0, 1.0, 0.0, 1.0]
Fine. → [1.0, 1.0, 1.0, 1.0]
  

Input Matrix X (3 × 4):

[
 [1.0, 0.0, 1.0, 0.0],  # I
 [0.0, 1.0, 0.0, 1.0],  # am
 [1.0, 1.0, 1.0, 1.0]   # Fine.
]
  

Step 2: Linear Projection to Q, K, V

Using identity matrices for W_Q, W_K, and W_V:

Q = K = V = [
 [1, 0, 1, 0],
 [0, 1, 0, 1],
 [1, 1, 1, 1]
]
  

Step 3: Compute Attention Scores (Q × Kᵀ)

Dot-product scores:

[
 [2, 0, 2],
 [0, 2, 2],
 [2, 2, 4]
]
  

Step 4: Scale the Scores

Divide by √dₖ = √4 = 2:

[
 [1.0, 0.0, 1.0],
 [0.0, 1.0, 1.0],
 [1.0, 1.0, 2.0]
]
  

Step 5: Apply Softmax

Attention weights:

[
 [0.422, 0.155, 0.422],
 [0.155, 0.422, 0.422],
 [0.211, 0.211, 0.577]
]
  

Step 6: Multiply by Value Matrix V

Final attention output:

[
 [0.844, 0.577, 0.844, 0.577],  # "I"
 [0.577, 0.844, 0.577, 0.844],  # "am"
 [0.788, 0.788, 0.788, 0.788]   # "Fine."
]
  

Summary Table

StepOperationShape
Input EmbeddingsX(3 × 4)
Linear ProjectionsQ, K, V(3 × 4)
Dot Product Q×KᵀScores(3 × 3)
Scale÷√4(3 × 3)
SoftmaxAttention weights A(3 × 3)
Multiply A×VOutput(3 × 4)