Sixth Sense Attention

The transformer architecture, introduced by Vaswani et al. in Attention Is All You Need (2017), displaced recurrent networks as the dominant paradigm in sequence modelling. Its defining contribution was not any single idea, but a particular composition of ideas crystallised into one equation:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

This post builds that equation from first principles. We start with the abstract problem of soft information retrieval, introduce the geometric language of word vectors, derive each component in sequence, and close with a survey of the modern extensions that have made attention even more powerful — and more efficient — since 2017.

Prerequisites. Comfortable with vectors, matrix multiplication, and the chain rule. No prior knowledge of transformers required.

A Brief History

Attention did not originate with the transformer. Bahdanau et al. (2014) introduced the first attention mechanism for neural machine translation, allowing a decoder RNN to selectively focus on parts of an encoded source sequence rather than compressing everything into a single fixed-size vector. Luong et al. (2015) proposed several simplified variants. What Vaswani et al. contributed was the insight that you could remove the recurrence entirely and build a model composed purely of attention and feed-forward layers — the Transformer — that parallelises across the full sequence length and scales dramatically better with data and compute.

Today, scaled dot-product attention is the computational primitive underlying GPT-4, Claude, Gemini, LLaMA, and virtually every frontier model. Understanding it is not optional for anyone working in ML.

The Core Problem: Key-Value Retrieval

A key-value (KV) lookup takes three inputs — a set of n keys $k_1, \ldots, k_n$ , a corresponding set of n values $v_1, \ldots, v_n$ , and a query $q$ — and returns some value based on how well the query matches each key.

The familiar dictionary is the discrete version: the query must exactly match a key, and you get back exactly one value.

d = {"apple": 10, "banana": 5, "chair": 2}
d["apple"]  # → 10  (exact match)

The problem: what if the query is "fruit"? There is no exact match. Both "apple" and "banana" are partial matches; "chair" is irrelevant. A discrete lookup is forced to either fail or pick one arbitrarily.

Soft Retrieval

Instead of choosing, we blend. Assign a weight $\alpha_i \in [0,1]$ to each key measuring how relevant it is to the query, then return the weighted sum of values:

\text{output} = \sum_{i=1}^{n} \alpha_i\, v_i

where $\alpha_i \geq 0$ and $\sum_i \alpha_i = 1$ . These weights are the attention scores. For the "fruit" query, we might set $\alpha_{\text{apple}} = 0.6,\; \alpha_{\text{banana}} = 0.4,\; \alpha_{\text{chair}} = 0.0$ , giving output $0.6 \times 10 + 0.4 \times 5 + 0 \times 2 = 8$ .

The question is: where do the attention scores come from?

Word Vectors and the Geometry of Meaning

Represent each word as a real-valued vector $\mathbf{x} \in \mathbb{R}^{d_k}$ . Ideally, vectors whose words are semantically similar should be geometrically close.

Word Embedding Space

Words with similar meanings cluster together. The query "fruit" sits between the fruit cluster (high attention) and away from furniture (near-zero attention). Attention is a soft, differentiable version of nearest-neighbour retrieval.

These representations are not hand-crafted. In practice they emerge from self-supervised objectives: Word2Vec (Mikolov et al., 2013) trained embeddings by predicting surrounding words; modern transformers learn contextualised representations end-to-end as part of the full model. The key insight is that geometry encodes semantics — linear structure holds approximately:

\mathbf{v}_{\text{queen}} - \mathbf{v}_{\text{woman}} + \mathbf{v}_{\text{man}} \approx \mathbf{v}_{\text{king}}

Similarity via Dot Product

Given vectors for our query and each key, we need a scalar similarity measure. The dot product is the canonical choice:

\mathbf{q} \cdot \mathbf{k}_i = \sum_{j=1}^{d_k} q_j k_{ij} = \|\mathbf{q}\|\,\|\mathbf{k}_i\|\cos\theta_{qi}

Its sign encodes alignment: positive when vectors point in the same direction (similar), zero when orthogonal (unrelated), negative when opposing (dissimilar). Unlike cosine similarity, the dot product preserves magnitude, which carries useful information in learned representations — a word vector with a larger norm may represent a more salient concept.

Why not cosine similarity? Cosine normalises out the magnitude, treating all vectors as lying on the unit hypersphere. In a deep learning setting the magnitude of an activation is a signal — it conveys confidence or salience. Discarding it is an inductive bias we generally do not want to impose. With proper regularisation (LayerNorm, weight decay), the dot product behaves well without the normalisation step.

Computing the dot product between a single query and all n keys at once is a matrix-vector product. Stacking the key vectors row-wise into a matrix $K \in \mathbb{R}^{n \times d_k}$ :

\mathbf{s} = K\mathbf{q}^\top \in \mathbb{R}^n, \qquad s_i = \mathbf{q} \cdot \mathbf{k}_i

From Similarity Scores to Probabilities: Softmax

The raw scores $s_i$ live in $(-\infty, +\infty)$ . We need to convert them to a valid probability distribution over the keys. The softmax function does exactly this:

\text{softmax}(\mathbf{s})_i = \frac{e^{s_i}}{\sum_{j=1}^n e^{s_j}}

Raw Dot Products → Attention Weights

Softmax maps arbitrary real-valued scores to a probability simplex. The relative ordering is preserved (apple still wins), but every score is now a valid probability. Negative raw scores become small but non-zero weights.

Softmax is monotone — the ordering of scores is preserved. It is also differentiable everywhere, allowing gradients to flow back into the query and key vectors during training. Crucially, the exponential amplifies differences: a score that is slightly higher than its neighbours receives disproportionately more weight, giving attention a natural tendency towards sparsity.

Entropy and Sharpness

The Shannon entropy of the attention distribution quantifies how spread out the focus is:

H(\boldsymbol{\alpha}) = -\sum_{i=1}^n \alpha_i \log_2 \alpha_i

Low entropy (near 0 bits) means the model is attending sharply to one key. Maximum entropy ( $\log_2 n$ bits) means uniform attention — the model does not know which key to focus on. During training, you can observe the average attention entropy as a proxy for how decisive the model has become.

The Scaling Factor: Why 1/√dₖ?

Suppose query and key components are i.i.d. random variables with mean 0 and variance 1. The dot product $\mathbf{q} \cdot \mathbf{k} = \sum_{j=1}^{d_k} q_j k_j$ is a sum of $d_k$ such terms. By the Central Limit Theorem:

\mathbb{E}[\mathbf{q}\cdot\mathbf{k}] = 0, \qquad \mathrm{Var}[\mathbf{q}\cdot\mathbf{k}] = d_k

As $d_k$ grows, the dot products grow in standard deviation by $\sqrt{d_k}$ , pushing softmax into saturation — one score dominates and all gradients vanish. Dividing by $\sqrt{d_k}$ restores unit variance:

\mathrm{Var}\!\left[\frac{\mathbf{q}\cdot\mathbf{k}}{\sqrt{d_k}}\right] = 1

Effect of the 1/√dₖ Scaling Factor (dₖ = 64)

Without scaling, dot products in high-dimensional spaces tend to have large magnitudes, making softmax near-deterministic (all weight on one key). This saturates the softmax and vanishes gradients. Dividing by √dₖ keeps the pre-softmax variance at ~1, regardless of dimension.

The scaled score for the i-th key is therefore:

\alpha_i = \text{softmax}\!\left(\frac{\mathbf{q} \cdot \mathbf{k}_i}{\sqrt{d_k}}\right)

Values as Vectors

So far, values have been scalars. In practice each value is a vector $\mathbf{v}_i \in \mathbb{R}^{d_v}$ , stacked row-wise into $V \in \mathbb{R}^{n \times d_v}$ . The output of attention is then a weighted sum of value vectors:

\text{out} = \sum_{i=1}^n \alpha_i\, \mathbf{v}_i = \boldsymbol{\alpha}^\top V \in \mathbb{R}^{d_v}

The output is a vector in the value space — a blend of all value vectors weighted by how relevant their keys were to the query. This is the core computation.

Batching Over Queries: Matrix Form

In a transformer, every position in the sequence simultaneously acts as a query. For n queries stacked into $Q \in \mathbb{R}^{n \times d_k}$ , we compute all attention outputs in parallel:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Here $QK^\top \in \mathbb{R}^{n \times n}$ is the attention score matrix — every query against every key. Softmax is applied row-wise, producing $A \in \mathbb{R}^{n \times n}$ where each row is a valid distribution. Multiplying by $V$ produces the output $\in \mathbb{R}^{n \times d_v}$ .

The $n \times n$ score matrix is why naive attention is $O(n^2 d_k)$ in compute and $O(n^2)$ in memory — the quadratic bottleneck that has driven an enormous body of follow-up work.

Self-Attention: A Sequence Attending to Itself

In self-attention, the queries, keys, and values all derive from the same input sequence $X \in \mathbb{R}^{n \times d}$ via learned linear projections:

Q = X W_Q, \quad K = X W_K, \quad V = X W_V

where $W_Q, W_K \in \mathbb{R}^{d \times d_k}$ and $W_V \in \mathbb{R}^{d \times d_v}$ are learned weight matrices. Each position projects its representation into three roles — what it is looking for (Q), what it offers (K), and what it contributes (V) — independently. This factorisation is what makes attention so expressive: a single token can simultaneously be a strong key for some queries while being a weak key for others.

Cross-attention is the variant where Q comes from one sequence and K, V from another. It is used in encoder-decoder transformers (e.g., for machine translation) and in diffusion models where the image denoiser attends to a text prompt.

Visualising Attention

The attention matrix $A$ is directly interpretable. Each row is a distribution over keys, telling us where a given query position directed its focus. Below is an example from a single attention head on the sentence "The cat sat on the mat":

Attention Weight Matrix — “The cat sat on the mat”

Each row shows how much one token attends to every other token. Diagonal dominance indicates local self-attention; off-diagonal peaks reveal long-range dependencies. In this example, "cat" strongly attends to itself and "sat" (subject-verb binding), while "the" attends to "mat" (determiner-noun agreement).

Real transformer heads specialise: some track syntactic head-dependency relations, others track coreference, others attend to the previous token almost exclusively. Multi-head attention is the mechanism that allows all of this to happen simultaneously.

Multi-Head Attention

A single attention head computes one weighted combination. Multi-head attention runs h heads in parallel, each with its own projection matrices:

\text{head}_i = \text{Attention}(Q W_Q^{(i)},\; K W_K^{(i)},\; V W_V^{(i)})

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W_O

where $W_O \in \mathbb{R}^{h d_v \times d}$ is an output projection. Each head operates in a subspace of dimension $d_k = d / h$ , keeping total compute roughly equal to a single head at full dimension. The concatenated outputs are projected back to the model dimension $d$ .

From the kernel perspective, multi-head attention approximates a richer kernel function: while a single head computes one inner product, multiple heads each compute an inner product in a different subspace, together approximating a more complex similarity function over the input.

Causal Masking for Autoregressive Models

Language models generate tokens left-to-right. At position i, the model must not attend to positions j > i (future tokens that have not yet been generated). This is enforced by adding a mask $M \in \{0, -\infty\}^{n \times n}$ before softmax:

A = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)

where $M_{ij} = -\infty$ when $j > i$ and $0$ otherwise. Because $e^{-\infty} = 0$ , softmax completely zeros out masked positions. In practice $-10^9$ is used instead of true $-\infty$ for numerical stability. The resulting attention pattern is lower-triangular.

Implementation

import numpy as np

def softmax(x: np.ndarray) -> np.ndarray:
    """Row-wise softmax with numerical stability."""
    x = x - x.max(axis=-1, keepdims=True)   # subtract max for stability
    e = np.exp(x)
    return e / e.sum(axis=-1, keepdims=True)

def attention(
    Q: np.ndarray,   # (n_q, d_k)
    K: np.ndarray,   # (n_k, d_k)
    V: np.ndarray,   # (n_k, d_v)
    mask: np.ndarray | None = None,  # (n_q, n_k) additive mask
) -> np.ndarray:     # (n_q, d_v)
    d_k = K.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)          # (n_q, n_k)
    if mask is not None:
        scores = scores + mask               # −inf for masked positions
    weights = softmax(scores)                # (n_q, n_k), each row sums to 1
    return weights @ V                       # (n_q, d_v)

def causal_mask(n: int) -> np.ndarray:
    """Lower-triangular causal mask for autoregressive attention."""
    mask = np.triu(np.full((n, n), -1e9), k=1)
    return mask                              # 0 on/below diagonal, −1e9 above

def multi_head_attention(
    X: np.ndarray,                           # (n, d)
    W_Q: list[np.ndarray],                   # h × (d, d_k)
    W_K: list[np.ndarray],
    W_V: list[np.ndarray],
    W_O: np.ndarray,                         # (h*d_v, d)
    mask: np.ndarray | None = None,
) -> np.ndarray:
    heads = [
        attention(X @ wq, X @ wk, X @ wv, mask)
        for wq, wk, wv in zip(W_Q, W_K, W_V)
    ]
    return np.concatenate(heads, axis=-1) @ W_O

Modern Developments

The vanilla attention equation has spawned a rich research programme. Here are the developments most worth knowing:

Flash Attention (Dao et al., 2022)

Standard attention materialises the full $n \times n$ score matrix in GPU HBM (high-bandwidth memory), causing repeated slow memory reads/writes. Flash Attention restructures the computation into tiled blocks that stay in the fast on-chip SRAM, achieving the same mathematical result while reducing memory I/O by up to 5-10×. This is a purely algorithmic improvement — no approximation — and is now the default attention implementation in most frameworks.

Grouped-Query Attention (Ainslie et al., 2023)

In standard multi-head attention, each of the h heads has its own K and V projections. Grouped-query attention (GQA) groups query heads so that $g$ query heads share one pair of KV projections, reducing the KV cache size by a factor of $g$ at inference time with minimal quality loss. LLaMA 2 70B and Mistral-7B both use GQA.

Rotary Position Embeddings (Su et al., 2021)

Standard transformers inject position information via additive embeddings. Rotary Position Embeddings (RoPE) instead rotate query and key vectors by a position-dependent angle before computing the dot product. The dot product then naturally encodes the relative distance between positions:

(R_m \mathbf{q}) \cdot (R_n \mathbf{k}) = \mathbf{q}^\top R_{n-m}^\top \mathbf{k}

This gives transformers strong length generalisation and is used in LLaMA, Mistral, Gemma, and most modern open models.

Linear Attention and State Space Models

The quadratic $O(n^2)$ cost is fundamental to softmax attention. Linear attention (Katharopoulos et al., 2020) replaces softmax with a kernel decomposition $\phi(\mathbf{q}) \cdot \phi(\mathbf{k})$ , enabling $O(n)$ attention via the associativity of matrix multiplication. State space models such as Mamba (Gu & Dao, 2023) take a different route, replacing attention with a selective recurrence that achieves linear inference complexity while matching transformer quality on many tasks. Whether these will displace attention for long-context tasks is an open research question.

KV Cache

During autoregressive inference, at each new token generation step the K and V matrices are recomputed for all previous positions — wasteful, since they have not changed. The KV cache stores computed K and V tensors across generation steps, reducing per-step compute from $O(n^2)$ to $O(n)$ . The trade-off is memory: at batch size 1 and 32K context, a 70B model's KV cache can exceed 32 GB. GQA, quantised KV caches, and page-based memory management (vLLM's PagedAttention) all address this bottleneck.

Complexity Summary

Operation             Compute         Memory
─────────────────────────────────────────────────────
QKᵀ  (score matrix)   O(n² dₖ)        O(n²)
softmax               O(n²)           O(n²)
AV   (output)         O(n² dᵥ)        O(n dᵥ)
─────────────────────────────────────────────────────
Total (attention)     O(n² d)         O(n²)
Flash Attention       O(n² d)         O(n)   ← no materialisation
Linear Attention      O(n d²)         O(d²)  ← d >> n is problematic

The quadratic term in n is the reason context windows above a few thousand tokens were historically expensive. Flash Attention solves the memory bottleneck (the $O(n^2)$ HBM requirement) without solving the compute one. Extending context to 128K+ tokens (as in Claude 3, GPT-4 Turbo) requires a combination of Flash Attention, hardware with high FLOP/byte ratios, and architectural tricks like sliding-window attention in lower layers.

Putting It Together

Let's retrace the derivation:

Represent words as vectors in $\mathbb{R}^{d_k}$ so that geometric similarity encodes semantic similarity.
Compute query-key similarity via the dot product $s_i = \mathbf{q} \cdot \mathbf{k}_i$ .
Scale by $1/\sqrt{d_k}$ to control variance and prevent softmax saturation.
Apply softmax to convert scores into a probability distribution $\boldsymbol{\alpha}$ .
Return the weighted sum of value vectors: $\sum_i \alpha_i \mathbf{v}_i$ .
Generalise to matrices Q, K, V and you have the full equation.

\text{Attention}(Q, K, V) = \underbrace{\text{softmax}\!\left(\frac{\overbrace{QK^\top}^{\text{similarity scores}}}{\underbrace{\sqrt{d_k}}_{\text{variance control}}}\right)}_{\text{attention weights}} \underbrace{V}_{\text{values}}

Attention is, at its core, a differentiable, parallelisable, soft nearest- neighbour lookup over learned representations. Its power comes not from any single clever idea but from the composition: dot-product similarity gives it expressiveness, softmax gives it interpretability, the scaling gives it training stability, and the learned projections Q, K, V give it adaptability to any task.

That is the sixth sense — a direct, learned ability to reach across a sequence and pull in exactly the context needed, at every position, simultaneously.

Further reading
Vaswani et al. (2017) — Attention Is All You Need
Bahdanau et al. (2014) — Neural Machine Translation by Jointly Learning to Align and Translate
Dao et al. (2022) — FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Su et al. (2021) — RoFormer: Enhanced Transformer with Rotary Position Embedding
Gu & Dao (2023) — Mamba: Linear-Time Sequence Modeling with Selective State Spaces