The transformer architecture, introduced by Vaswani et al. in Attention Is All You Need (2017), displaced recurrent networks as the dominant paradigm in sequence modelling. Its defining contribution was not any single idea, but a particular composition of ideas crystallised into one equation:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

This post builds that equation from first principles. We start with the abstract problem of soft information retrieval, introduce the geometric language of word vectors, derive each component in sequence, and close with a survey of the modern extensions that have made attention even more powerful — and more efficient — since 2017.

Prerequisites. Comfortable with vectors, matrix multiplication, and the chain rule. No prior knowledge of transformers required.

A Brief History

Attention did not originate with the transformer. Bahdanau et al. (2014) introduced the first attention mechanism for neural machine translation, allowing a decoder RNN to selectively focus on parts of an encoded source sequence rather than compressing everything into a single fixed-size vector. Luong et al. (2015) proposed several simplified variants. What Vaswani et al. contributed was the insight that you could remove the recurrence entirely and build a model composed purely of attention and feed-forward layers — the Transformer — that parallelises across the full sequence length and scales dramatically better with data and compute.

Today, scaled dot-product attention is the computational primitive underlying GPT-4, Claude, Gemini, LLaMA, and virtually every frontier model. Understanding it is not optional for anyone working in ML.

The Core Problem: Key-Value Retrieval

A key-value (KV) lookup takes three inputs — a set of n keys k1,,knk_1, \ldots, k_n, a corresponding set of n values v1,,vnv_1, \ldots, v_n, and a query qq — and returns some value based on how well the query matches each key.

The familiar dictionary is the discrete version: the query must exactly match a key, and you get back exactly one value.

d = {"apple": 10, "banana": 5, "chair": 2}
d["apple"]  # → 10  (exact match)

The problem: what if the query is "fruit"? There is no exact match. Both "apple" and "banana" are partial matches; "chair" is irrelevant. A discrete lookup is forced to either fail or pick one arbitrarily.

Soft Retrieval

Instead of choosing, we blend. Assign a weight αi[0,1]\alpha_i \in [0,1] to each key measuring how relevant it is to the query, then return the weighted sum of values:

output=i=1nαivi\text{output} = \sum_{i=1}^{n} \alpha_i\, v_i

where αi0\alpha_i \geq 0 and iαi=1\sum_i \alpha_i = 1. These weights are the attention scores. For the "fruit" query, we might set αapple=0.6,  αbanana=0.4,  αchair=0.0\alpha_{\text{apple}} = 0.6,\; \alpha_{\text{banana}} = 0.4,\; \alpha_{\text{chair}} = 0.0, giving output 0.6×10+0.4×5+0×2=80.6 \times 10 + 0.4 \times 5 + 0 \times 2 = 8.

The question is: where do the attention scores come from?

Word Vectors and the Geometry of Meaning

Represent each word as a real-valued vector xRdk\mathbf{x} \in \mathbb{R}^{d_k}. Ideally, vectors whose words are semantically similar should be geometrically close.

Word Embedding Space
d₁d₂applebananamangocarrotbroccolispinachchairtabledesk"fruit" (query)fruitvegetablefurniturequery
Words with similar meanings cluster together. The query "fruit" sits between the fruit cluster (high attention) and away from furniture (near-zero attention). Attention is a soft, differentiable version of nearest-neighbour retrieval.

These representations are not hand-crafted. In practice they emerge from self-supervised objectives: Word2Vec (Mikolov et al., 2013) trained embeddings by predicting surrounding words; modern transformers learn contextualised representations end-to-end as part of the full model. The key insight is that geometry encodes semantics — linear structure holds approximately:

vqueenvwoman+vmanvking\mathbf{v}_{\text{queen}} - \mathbf{v}_{\text{woman}} + \mathbf{v}_{\text{man}} \approx \mathbf{v}_{\text{king}}

Similarity via Dot Product

Given vectors for our query and each key, we need a scalar similarity measure. The dot product is the canonical choice:

qki=j=1dkqjkij=qkicosθqi\mathbf{q} \cdot \mathbf{k}_i = \sum_{j=1}^{d_k} q_j k_{ij} = \|\mathbf{q}\|\,\|\mathbf{k}_i\|\cos\theta_{qi}

Its sign encodes alignment: positive when vectors point in the same direction (similar), zero when orthogonal (unrelated), negative when opposing (dissimilar). Unlike cosine similarity, the dot product preserves magnitude, which carries useful information in learned representations — a word vector with a larger norm may represent a more salient concept.

Why not cosine similarity? Cosine normalises out the magnitude, treating all vectors as lying on the unit hypersphere. In a deep learning setting the magnitude of an activation is a signal — it conveys confidence or salience. Discarding it is an inductive bias we generally do not want to impose. With proper regularisation (LayerNorm, weight decay), the dot product behaves well without the normalisation step.

Computing the dot product between a single query and all n keys at once is a matrix-vector product. Stacking the key vectors row-wise into a matrix KRn×dkK \in \mathbb{R}^{n \times d_k}:

s=KqRn,si=qki\mathbf{s} = K\mathbf{q}^\top \in \mathbb{R}^n, \qquad s_i = \mathbf{q} \cdot \mathbf{k}_i

From Similarity Scores to Probabilities: Softmax

The raw scores sis_i live in (,+)(-\infty, +\infty). We need to convert them to a valid probability distribution over the keys. The softmax function does exactly this:

softmax(s)i=esij=1nesj\text{softmax}(\mathbf{s})_i = \frac{e^{s_i}}{\sum_{j=1}^n e^{s_j}}
Raw Dot Products → Attention Weights
Raw scores q · kᵢ4.2apple3.8banana0.6chair-0.9desksoftmaxAttention weights αᵢ0.59apple0.39banana0.02chair0.00desk✓ between 0–1   ✓ sums to 1
Softmax maps arbitrary real-valued scores to a probability simplex. The relative ordering is preserved (apple still wins), but every score is now a valid probability. Negative raw scores become small but non-zero weights.

Softmax is monotone — the ordering of scores is preserved. It is also differentiable everywhere, allowing gradients to flow back into the query and key vectors during training. Crucially, the exponential amplifies differences: a score that is slightly higher than its neighbours receives disproportionately more weight, giving attention a natural tendency towards sparsity.

Entropy and Sharpness

The Shannon entropy of the attention distribution quantifies how spread out the focus is:

H(α)=i=1nαilog2αiH(\boldsymbol{\alpha}) = -\sum_{i=1}^n \alpha_i \log_2 \alpha_i

Low entropy (near 0 bits) means the model is attending sharply to one key. Maximum entropy (log2n\log_2 n bits) means uniform attention — the model does not know which key to focus on. During training, you can observe the average attention entropy as a proxy for how decisive the model has become.

The Scaling Factor: Why 1/√dₖ?

Suppose query and key components are i.i.d. random variables with mean 0 and variance 1. The dot product qk=j=1dkqjkj\mathbf{q} \cdot \mathbf{k} = \sum_{j=1}^{d_k} q_j k_j is a sum of dkd_k such terms. By the Central Limit Theorem:

E[qk]=0,Var[qk]=dk\mathbb{E}[\mathbf{q}\cdot\mathbf{k}] = 0, \qquad \mathrm{Var}[\mathbf{q}\cdot\mathbf{k}] = d_k

As dkd_k grows, the dot products grow in standard deviation by dk\sqrt{d_k}, pushing softmax into saturation — one score dominates and all gradients vanish. Dividing by dk\sqrt{d_k} restores unit variance:

Var ⁣[qkdk]=1\mathrm{Var}\!\left[\frac{\mathbf{q}\cdot\mathbf{k}}{\sqrt{d_k}}\right] = 1
Effect of the 1/√dₖ Scaling Factor (dₖ = 64)
Without scalingsoftmax([8, 7, 3, 1])0.73k₁0.27k₂0.00k₃0.00k₄entropy = 0.89 bitsWith 1/√dₖ scalingsoftmax([8, 7, 3, 1] / √64)0.35k₁0.31k₂0.19k₃0.15k₄entropy = 1.92 bits
Without scaling, dot products in high-dimensional spaces tend to have large magnitudes, making softmax near-deterministic (all weight on one key). This saturates the softmax and vanishes gradients. Dividing by √dₖ keeps the pre-softmax variance at ~1, regardless of dimension.

The scaled score for the i-th key is therefore:

αi=softmax ⁣(qkidk)\alpha_i = \text{softmax}\!\left(\frac{\mathbf{q} \cdot \mathbf{k}_i}{\sqrt{d_k}}\right)

Values as Vectors

So far, values have been scalars. In practice each value is a vector viRdv\mathbf{v}_i \in \mathbb{R}^{d_v}, stacked row-wise into VRn×dvV \in \mathbb{R}^{n \times d_v}. The output of attention is then a weighted sum of value vectors:

out=i=1nαivi=αVRdv\text{out} = \sum_{i=1}^n \alpha_i\, \mathbf{v}_i = \boldsymbol{\alpha}^\top V \in \mathbb{R}^{d_v}

The output is a vector in the value space — a blend of all value vectors weighted by how relevant their keys were to the query. This is the core computation.

Batching Over Queries: Matrix Form

In a transformer, every position in the sequence simultaneously acts as a query. For n queries stacked into QRn×dkQ \in \mathbb{R}^{n \times d_k}, we compute all attention outputs in parallel:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Here QKRn×nQK^\top \in \mathbb{R}^{n \times n} is the attention score matrix — every query against every key. Softmax is applied row-wise, producing ARn×nA \in \mathbb{R}^{n \times n} where each row is a valid distribution. Multiplying by VV produces the output Rn×dv\in \mathbb{R}^{n \times d_v}.

The n×nn \times n score matrix is why naive attention is O(n2dk)O(n^2 d_k) in compute and O(n2)O(n^2) in memory — the quadratic bottleneck that has driven an enormous body of follow-up work.

Self-Attention: A Sequence Attending to Itself

In self-attention, the queries, keys, and values all derive from the same input sequence XRn×dX \in \mathbb{R}^{n \times d} via learned linear projections:

Q=XWQ,K=XWK,V=XWVQ = X W_Q, \quad K = X W_K, \quad V = X W_V

where WQ,WKRd×dkW_Q, W_K \in \mathbb{R}^{d \times d_k} and WVRd×dvW_V \in \mathbb{R}^{d \times d_v} are learned weight matrices. Each position projects its representation into three roles — what it is looking for (Q), what it offers (K), and what it contributes (V) — independently. This factorisation is what makes attention so expressive: a single token can simultaneously be a strong key for some queries while being a weak key for others.

Cross-attention is the variant where Q comes from one sequence and K, V from another. It is used in encoder-decoder transformers (e.g., for machine translation) and in diffusion models where the image denoiser attends to a text prompt.

Visualising Attention

The attention matrix AA is directly interpretable. Each row is a distribution over keys, telling us where a given query position directed its focus. Below is an example from a single attention head on the sentence "The cat sat on the mat":

Attention Weight Matrix — “The cat sat on the mat”
ThecatsatonthematKeys →ThecatsatonthematQueries ↓0.550.150.100.080.070.050.220.420.180.050.080.050.100.300.320.120.040.120.080.060.140.480.100.140.280.040.040.120.380.140.080.120.100.140.180.380.01.0
Each row shows how much one token attends to every other token. Diagonal dominance indicates local self-attention; off-diagonal peaks reveal long-range dependencies. In this example, "cat" strongly attends to itself and "sat" (subject-verb binding), while "the" attends to "mat" (determiner-noun agreement).

Real transformer heads specialise: some track syntactic head-dependency relations, others track coreference, others attend to the previous token almost exclusively. Multi-head attention is the mechanism that allows all of this to happen simultaneously.

Multi-Head Attention

A single attention head computes one weighted combination. Multi-head attention runs h heads in parallel, each with its own projection matrices:

headi=Attention(QWQ(i),  KWK(i),  VWV(i))\text{head}_i = \text{Attention}(Q W_Q^{(i)},\; K W_K^{(i)},\; V W_V^{(i)})
MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W_O

where WORhdv×dW_O \in \mathbb{R}^{h d_v \times d} is an output projection. Each head operates in a subspace of dimension dk=d/hd_k = d / h, keeping total compute roughly equal to a single head at full dimension. The concatenated outputs are projected back to the model dimension dd.

From the kernel perspective, multi-head attention approximates a richer kernel function: while a single head computes one inner product, multiple heads each compute an inner product in a different subspace, together approximating a more complex similarity function over the input.

Causal Masking for Autoregressive Models

Language models generate tokens left-to-right. At position i, the model must not attend to positions j > i (future tokens that have not yet been generated). This is enforced by adding a mask M{0,}n×nM \in \{0, -\infty\}^{n \times n} before softmax:

A=softmax ⁣(QKdk+M)A = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)

where Mij=M_{ij} = -\infty when j>ij > i and 00 otherwise. Because e=0e^{-\infty} = 0, softmax completely zeros out masked positions. In practice 109-10^9 is used instead of true -\infty for numerical stability. The resulting attention pattern is lower-triangular.

Implementation

import numpy as np

def softmax(x: np.ndarray) -> np.ndarray:
    """Row-wise softmax with numerical stability."""
    x = x - x.max(axis=-1, keepdims=True)   # subtract max for stability
    e = np.exp(x)
    return e / e.sum(axis=-1, keepdims=True)

def attention(
    Q: np.ndarray,   # (n_q, d_k)
    K: np.ndarray,   # (n_k, d_k)
    V: np.ndarray,   # (n_k, d_v)
    mask: np.ndarray | None = None,  # (n_q, n_k) additive mask
) -> np.ndarray:     # (n_q, d_v)
    d_k = K.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)          # (n_q, n_k)
    if mask is not None:
        scores = scores + mask               # −inf for masked positions
    weights = softmax(scores)                # (n_q, n_k), each row sums to 1
    return weights @ V                       # (n_q, d_v)

def causal_mask(n: int) -> np.ndarray:
    """Lower-triangular causal mask for autoregressive attention."""
    mask = np.triu(np.full((n, n), -1e9), k=1)
    return mask                              # 0 on/below diagonal, −1e9 above

def multi_head_attention(
    X: np.ndarray,                           # (n, d)
    W_Q: list[np.ndarray],                   # h × (d, d_k)
    W_K: list[np.ndarray],
    W_V: list[np.ndarray],
    W_O: np.ndarray,                         # (h*d_v, d)
    mask: np.ndarray | None = None,
) -> np.ndarray:
    heads = [
        attention(X @ wq, X @ wk, X @ wv, mask)
        for wq, wk, wv in zip(W_Q, W_K, W_V)
    ]
    return np.concatenate(heads, axis=-1) @ W_O

Modern Developments

The vanilla attention equation has spawned a rich research programme. Here are the developments most worth knowing:

Flash Attention (Dao et al., 2022)

Standard attention materialises the full n×nn \times n score matrix in GPU HBM (high-bandwidth memory), causing repeated slow memory reads/writes. Flash Attention restructures the computation into tiled blocks that stay in the fast on-chip SRAM, achieving the same mathematical result while reducing memory I/O by up to 5-10×. This is a purely algorithmic improvement — no approximation — and is now the default attention implementation in most frameworks.

Grouped-Query Attention (Ainslie et al., 2023)

In standard multi-head attention, each of the h heads has its own K and V projections. Grouped-query attention (GQA) groups query heads so that gg query heads share one pair of KV projections, reducing the KV cache size by a factor of gg at inference time with minimal quality loss. LLaMA 2 70B and Mistral-7B both use GQA.

Rotary Position Embeddings (Su et al., 2021)

Standard transformers inject position information via additive embeddings. Rotary Position Embeddings (RoPE) instead rotate query and key vectors by a position-dependent angle before computing the dot product. The dot product then naturally encodes the relative distance between positions:

(Rmq)(Rnk)=qRnmk(R_m \mathbf{q}) \cdot (R_n \mathbf{k}) = \mathbf{q}^\top R_{n-m}^\top \mathbf{k}

This gives transformers strong length generalisation and is used in LLaMA, Mistral, Gemma, and most modern open models.

Linear Attention and State Space Models

The quadratic O(n2)O(n^2) cost is fundamental to softmax attention. Linear attention (Katharopoulos et al., 2020) replaces softmax with a kernel decomposition ϕ(q)ϕ(k)\phi(\mathbf{q}) \cdot \phi(\mathbf{k}), enabling O(n)O(n) attention via the associativity of matrix multiplication. State space models such as Mamba (Gu & Dao, 2023) take a different route, replacing attention with a selective recurrence that achieves linear inference complexity while matching transformer quality on many tasks. Whether these will displace attention for long-context tasks is an open research question.

KV Cache

During autoregressive inference, at each new token generation step the K and V matrices are recomputed for all previous positions — wasteful, since they have not changed. The KV cache stores computed K and V tensors across generation steps, reducing per-step compute from O(n2)O(n^2) to O(n)O(n). The trade-off is memory: at batch size 1 and 32K context, a 70B model's KV cache can exceed 32 GB. GQA, quantised KV caches, and page-based memory management (vLLM's PagedAttention) all address this bottleneck.

Complexity Summary

Operation             Compute         Memory
─────────────────────────────────────────────────────
QKᵀ  (score matrix)   O(n² dₖ)        O(n²)
softmax               O(n²)           O(n²)
AV   (output)         O(n² dᵥ)        O(n dᵥ)
─────────────────────────────────────────────────────
Total (attention)     O(n² d)         O(n²)
Flash Attention       O(n² d)         O(n)   ← no materialisation
Linear Attention      O(n d²)         O(d²)  ← d >> n is problematic

The quadratic term in n is the reason context windows above a few thousand tokens were historically expensive. Flash Attention solves the memory bottleneck (the O(n2)O(n^2) HBM requirement) without solving the compute one. Extending context to 128K+ tokens (as in Claude 3, GPT-4 Turbo) requires a combination of Flash Attention, hardware with high FLOP/byte ratios, and architectural tricks like sliding-window attention in lower layers.

Putting It Together

Let's retrace the derivation:

  1. Represent words as vectors in Rdk\mathbb{R}^{d_k} so that geometric similarity encodes semantic similarity.
  2. Compute query-key similarity via the dot product si=qkis_i = \mathbf{q} \cdot \mathbf{k}_i.
  3. Scale by 1/dk1/\sqrt{d_k} to control variance and prevent softmax saturation.
  4. Apply softmax to convert scores into a probability distribution α\boldsymbol{\alpha}.
  5. Return the weighted sum of value vectors: iαivi\sum_i \alpha_i \mathbf{v}_i.
  6. Generalise to matrices Q, K, V and you have the full equation.
Attention(Q,K,V)=softmax ⁣(QKsimilarity scoresdkvariance control)attention weightsVvalues\text{Attention}(Q, K, V) = \underbrace{\text{softmax}\!\left(\frac{\overbrace{QK^\top}^{\text{similarity scores}}}{\underbrace{\sqrt{d_k}}_{\text{variance control}}}\right)}_{\text{attention weights}} \underbrace{V}_{\text{values}}

Attention is, at its core, a differentiable, parallelisable, soft nearest- neighbour lookup over learned representations. Its power comes not from any single clever idea but from the composition: dot-product similarity gives it expressiveness, softmax gives it interpretability, the scaling gives it training stability, and the learned projections Q, K, V give it adaptability to any task.

That is the sixth sense — a direct, learned ability to reach across a sequence and pull in exactly the context needed, at every position, simultaneously.

Further reading
Vaswani et al. (2017) — Attention Is All You Need
Bahdanau et al. (2014) — Neural Machine Translation by Jointly Learning to Align and Translate
Dao et al. (2022) — FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Su et al. (2021) — RoFormer: Enhanced Transformer with Rotary Position Embedding
Gu & Dao (2023) — Mamba: Linear-Time Sequence Modeling with Selective State Spaces