Transformer Attention Basics Simulator Back
Machine Learning

Transformer Attention Basics Simulator

Experience scaled dot-product attention, the heart of the Transformer, with just three tokens. Change each token's direction vector, the query and the temperature (scaling), and the dot-product scores, softmax attention weights, output vector and attention entropy update in real time, so you can intuitively grasp how attention is formed.

Parameters
Token 1 direction θ₁
°
Direction of key₁=value₁ (a 2-D unit vector)
Token 2 direction θ₂
°
Direction of key₂=value₂ (a 2-D unit vector)
Token 3 direction θ₃
°
Direction of key₃=value₃ (a 2-D unit vector)
Query (the attending side)
Uses the embedding of the chosen token as the query Q
Temperature (scaling) τ
Equivalent to √d_k scaling. Smaller = sharper attention
Results
Attention weight w₁
Attention weight w₂
Attention weight w₃
Output vector angle (°)
Most-attended token
Attention entropy
Attention visualisation — query and three tokens

The arrows on the unit circle are the key directions of the three tokens. The query token is highlighted, and the thickness and opacity of the line from the query to each token represent the attention weight w_i. The orange arrow is the weighted sum of the values (the output vector).

Distribution of attention weights
Attention weight vs temperature τ
Theory & Key Formulas

$$\operatorname{Attention}(Q,K,V)=\operatorname{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V$$

Scaled dot-product attention. The dot product of the query Q with each key K gives a score, which is divided by √d_k, normalised into weights by softmax, and used to take a weighted average of the values V.

$$w_i=\frac{\exp(s_i/\tau)}{\sum_j \exp(s_j/\tau)},\quad s_i=Q\cdot K_i$$

Attention weight w_i and raw score s_i. τ is the temperature (scaling). With unit vectors, s_i=cos(θ_query−θ_i), so a token closer in direction gets a higher score.

$$H=-\sum_i w_i\ln w_i,\qquad \mathbf{o}=\sum_i w_i\,\mathbf{v}_i$$

Attention entropy H (0 = fully focused, ln3≈1.099 = uniform) and the output vector o. The √d_k scaling keeps softmax in a sensitive range and prevents vanishing gradients.

What is the Attention Mechanism?

🙋
Everyone keeps talking about the "attention" in the "Transformer" — but what does it actually do? Just the word alone doesn't give me a picture.
🎓
Roughly speaking, it is a mechanism that decides, with weights, which other words the current word should look at in a sentence. Take "He ate the apple, and it was red." When processing "it", linking strongly to "apple" makes the meaning connect. Attention computes "where to look and how much" as numbers — the attention weights. The tool on the left turns tokens into three arrows so you can see how those weights are formed.
🙋
I see. But how are the "weights" decided? It's not as if the computer just somehow knows "this part is important", right?
🎓
Good question. The key is the dot product of the "query" and the "key". Each token is represented by a vector that has a direction. Take the dot product of the attending query with each token's key, and the closer the directions the larger the value. That is the "raw score". Align all the tokens in the same direction in the tool and you'll see every score become 1, so the weights are equal. Conversely a token pointing exactly opposite the query gets a negative score and is almost ignored.
🙋
Once you have the scores, what comes next? You don't seem to use them as-is.
🎓
Right, the scores are not yet "weights". You pass them through a function called softmax, which turns them into probability-like values that sum to 1. Those are the attention weights w. And before that, you divide by the "temperature τ". Lowering τ exaggerates the differences between scores, so almost all the weight gathers on the single most similar token — "sharp attention". Raising τ smooths the differences out, moving toward equal weight across the three tokens — "diffuse attention". Move the temperature slider in the tool and watch the bar chart change shape.
🙋
The description of the temperature slider says "equivalent to √d_k". What is that square root there for?
🎓
A point that matters a lot in practice. In a real Transformer the vector dimension d_k is large, like 64 or 128. When the dimension is large the dot product also keeps growing, and feeding that into softmax pins it flat at 0 or 1. When that happens the gradient becomes nearly zero and learning stops. So the dot product is divided by √d_k to bring the score magnitude back into a reasonable range. The "temperature" in this tool is exactly that division factor, moved by hand.
🙋
Last thing — what do the "output vector" and "attention entropy" at the end mean?
🎓
The output vector is "the result of mixing the values by the attention weights". It is a weighted-average vector pulled toward the direction of the high-weight tokens. Attention entropy is an index of how spread out the weights are — 0 when focused on one token, and the maximum ln3≈1.10 when split evenly across three. In real Transformer analysis people also measure per-head entropy to classify "narrowly focused" versus "broadly averaging" heads. Raise and lower the temperature in the tool and you'll see the entropy and the output direction move together.

Frequently Asked Questions

Scaled dot-product attention is the central operation of the Transformer, written Attention(Q,K,V)=softmax(QKᵀ/√d_k)·V. The dot product of a query Q with each key K produces a score for 'how much to attend to that token'; the scores are divided by √d_k, normalised by softmax into weights between 0 and 1, and used to take a weighted average of the values V. This tool uses three 2-D unit vectors so you can follow that whole pipeline with your eyes.
When the dimension d_k is large, the dot product of the query and a key has a variance that grows with d_k, so the values fed into softmax become extremely large or small. Softmax then enters its saturated region near 0 or 1, the gradient vanishes and training stalls. Dividing the dot product by √d_k keeps the variance of the scores close to 1, so softmax works in its sensitive range and gradient vanishing is avoided. The 'temperature' slider in this tool plays the role of that scaling factor — smaller means sharper attention, larger means more uniform.
Attention entropy measures how spread out the weight distribution w is, computed as H = -Σ w_i·ln(w_i). When the weight is concentrated on one token, H is close to 0 (sharp attention); when it is split evenly across three tokens (1/3 each), H reaches its maximum of ln(3)≈1.0986 (diffuse attention). Lowering the temperature reduces H, raising it increases H. In real Transformer analysis, the per-head entropy is often used to classify heads as 'local' or 'broadly averaging'.
The query Q represents 'what am I looking for right now', the key K is a label of 'what each token can offer', and the value V is the actual content that gets retrieved. The dot product of Q and K measures relevance (the attention score), and the resulting weights mix the values V into the output. As a library analogy: Q is the search keyword, K is the label on the book spine, and V is the content inside the book. For simplicity this tool makes key and value the same unit vector, but in a real Transformer Q, K and V are each produced from the embedding by separate weight matrices.

Real-World Applications

Large language models (LLMs): Conversational AIs such as GPT and Claude are built by stacking many Transformer blocks. The self-attention in each layer learns "which words in the sentence to bind together", capturing the referent of a pronoun or the dependency between distant words. The behaviour you see in this tool — "the query attends to the token closest in direction" — is essentially the same even in a huge model.

Machine translation and summarisation: Attention originally emerged in neural machine translation to align "which part of the source the word being generated corresponds to". Visualising the attention weights of a translation reveals an "alignment" in which source and target words line up almost along the diagonal. In summarisation models too, attention lets you trace which part of the input an output sentence is grounded in.

Image recognition (Vision Transformer): In a Vision Transformer, which splits an image into patches treated as tokens, attention learns "which regions of the image are related". Looking at a heat map of the attention weights, object outlines and semantically coherent regions emerge, showing a wide-range context integration different from a CNN.

Model interpretation and visualisation tools: In research and practice, tools such as BertViz visualise attention weights to investigate what a model bases its predictions on. This simulator carves out the smallest unit of that — how scores and softmax move between one query and a few keys — and helps build the intuition you need before reading visualisation results.

Common Misconceptions and Pitfalls

The biggest misconception is the assumption that "attention weights show the model's reasoning itself". A token with a high attention weight does indeed influence the output more easily, but the final output is determined by the content of the weighted values and the non-linear transforms in later layers as well. Jumping to "high weight = that is the reason" leads to wrong explanations. Attention is a clue for interpretation, but it is not equivalent to a causal explanation. Treat visualisations strictly as a tool for generating hypotheses.

Next, the misconception that "lower temperature (scaling) is always better". Lowering the temperature sharpens the attention and concentrates the weight on the single most relevant token. That looks desirable at first, but attention that is too sharp saturates softmax, the gradient vanishes and training stops. And if it only ever looks at one token, it loses the ability to integrate multiple cues. A real Transformer uses the "moderate scale" of √d_k precisely to balance sharpness against smoothness. Try setting the temperature extremely small and extremely large in this tool and verify the drawbacks at both extremes.

Finally, do not think that "the 2-D, three-token setup of this tool is the same as a real Transformer". In an actual model the embedding dimension is in the hundreds, the number of tokens in the thousands, multiple attention heads run in parallel, and Q, K and V are each made by separate trained weight matrices. This tool is a minimal model that fixes key=value to the same unit vector, and is only meant to let you experience the skeleton "dot product → scale → softmax → weighted sum". Once you have grasped the skeleton, move on to the multi-dimensional, multi-head and positional-encoding elements of a real model.

How to Use

  1. Set embedding values for Token 1, Token 2, and Token 3 using the numerical inputs (range -10 to +10 each), representing 1D token representations in the query-key space.
  2. Adjust temperature (0.1 to 2.0) to control attention sharpness: lower values create peaked distributions favoring single tokens, higher values flatten attention across all tokens.
  3. Observe real-time scaled dot-product attention weights w₁, w₂, w₃ computed as softmax(Q·K^T / √d_k · T⁻¹), where d_k=1 and the output vector angle in degrees relative to Token 1 direction.

Worked Example

Set Token 1=5.0, Token 2=2.0, Token 3=-3.0, Temperature=1.0. Raw dot products: 5.0, 2.0, -3.0. Softmax yields approximately w₁=0.747, w₂=0.175, w₃=0.078. Weighted sum=4.11, Output angle≈0° (aligned with highest-scoring token). Raising temperature to 2.0 softens attention to w₁=0.437, w₂=0.315, w₃=0.248, spreading focus and rotating output to ~25°. Entropy increases from 0.61 to 1.04 nats.

Practical Notes

  1. Temperature=0.1 mimics hard attention used in inference; production Transformers often employ T=1.0 during training for gradient stability across attention heads.
  2. Negative token values create attention competition; crossing zero embeddings produces near-uniform weight distribution critical for multi-headed fusion in BERT, GPT architectures.
  3. Monitor Attention entropy: values <0.5 indicate single-token dominance (potential gradient bottleneck), >1.5 indicate insufficient selectivity in sparse attention patterns.