The arrows on the unit circle are the key directions of the three tokens. The query token is highlighted, and the thickness and opacity of the line from the query to each token represent the attention weight w_i. The orange arrow is the weighted sum of the values (the output vector).
$$\operatorname{Attention}(Q,K,V)=\operatorname{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V$$
Scaled dot-product attention. The dot product of the query Q with each key K gives a score, which is divided by √d_k, normalised into weights by softmax, and used to take a weighted average of the values V.
$$w_i=\frac{\exp(s_i/\tau)}{\sum_j \exp(s_j/\tau)},\quad s_i=Q\cdot K_i$$
Attention weight w_i and raw score s_i. τ is the temperature (scaling). With unit vectors, s_i=cos(θ_query−θ_i), so a token closer in direction gets a higher score.
$$H=-\sum_i w_i\ln w_i,\qquad \mathbf{o}=\sum_i w_i\,\mathbf{v}_i$$
Attention entropy H (0 = fully focused, ln3≈1.099 = uniform) and the output vector o. The √d_k scaling keeps softmax in a sensitive range and prevents vanishing gradients.