"I am Fine."
We will walk through scaled dot-product attention as used in the Transformer architecture (Vaswani et al., 2017).
Element | Value |
---|---|
Sentence | "I am Fine." (3 tokens) |
Vocabulary size | 10,000 |
Embedding size (d_model) | 4 |
Number of tokens (seq_len) | 3 |
Attention head dimension (d_k) | 4 |
Tokens: [I, am, Fine.]
Embedding matrix E
shape: 10000 × 4
Assumed embeddings:
I → [1.0, 0.0, 1.0, 0.0] am → [0.0, 1.0, 0.0, 1.0] Fine. → [1.0, 1.0, 1.0, 1.0]
Input Matrix X (3 × 4):
[ [1.0, 0.0, 1.0, 0.0], # I [0.0, 1.0, 0.0, 1.0], # am [1.0, 1.0, 1.0, 1.0] # Fine. ]
Using identity matrices for W_Q
, W_K
, and W_V
:
Q = K = V = [ [1, 0, 1, 0], [0, 1, 0, 1], [1, 1, 1, 1] ]
Dot-product scores:
[ [2, 0, 2], [0, 2, 2], [2, 2, 4] ]
Divide by √dₖ = √4 = 2:
[ [1.0, 0.0, 1.0], [0.0, 1.0, 1.0], [1.0, 1.0, 2.0] ]
Attention weights:
[ [0.422, 0.155, 0.422], [0.155, 0.422, 0.422], [0.211, 0.211, 0.577] ]
Final attention output:
[ [0.844, 0.577, 0.844, 0.577], # "I" [0.577, 0.844, 0.577, 0.844], # "am" [0.788, 0.788, 0.788, 0.788] # "Fine." ]
Step | Operation | Shape |
---|---|---|
Input Embeddings | X | (3 × 4) |
Linear Projections | Q, K, V | (3 × 4) |
Dot Product Q×Kᵀ | Scores | (3 × 3) |
Scale | ÷√4 | (3 × 3) |
Softmax | Attention weights A | (3 × 3) |
Multiply A×V | Output | (3 × 4) |