Attention Mechanism in Transformer

Input Sentence:

"I am Fine."

We will walk through scaled dot-product attention as used in the Transformer architecture (Vaswani et al., 2017).

Tokens: [I, am, Fine.]

Embedding matrix E shape: 10000 × 4

Assumed embeddings:

I     → [1.0, 0.0, 1.0, 0.0]
am    → [0.0, 1.0, 0.0, 1.0]
Fine. → [1.0, 1.0, 1.0, 1.0]

Input Matrix X (3 × 4):

[
 [1.0, 0.0, 1.0, 0.0],  # I
 [0.0, 1.0, 0.0, 1.0],  # am
 [1.0, 1.0, 1.0, 1.0]   # Fine.
]

Using identity matrices for W_Q, W_K, and W_V:

Q = K = V = [
 [1, 0, 1, 0],
 [0, 1, 0, 1],
 [1, 1, 1, 1]
]

Dot-product scores:

[
 [2, 0, 2],
 [0, 2, 2],
 [2, 2, 4]
]

Divide by √dₖ = √4 = 2:

[
 [1.0, 0.0, 1.0],
 [0.0, 1.0, 1.0],
 [1.0, 1.0, 2.0]
]

Attention weights:

[
 [0.422, 0.155, 0.422],
 [0.155, 0.422, 0.422],
 [0.211, 0.211, 0.577]
]

Final attention output:

[
 [0.844, 0.577, 0.844, 0.577],  # "I"
 [0.577, 0.844, 0.577, 0.844],  # "am"
 [0.788, 0.788, 0.788, 0.788]   # "Fine."
]