The Bayesian Geometry of Attention

Introduction

Large language models can summarize case law, write proofs, and debug distributed systems. But when they do these things, what are they actually computing? One enduring hypothesis is that transformers are, under the hood, performing something like Bayesian inference — maintaining a posterior distribution over latent hypotheses and updating it as tokens stream in. The intuition is appealing: a model reading a mystery novel ought to track the probability each character is the culprit, sharpening its beliefs with each new clue.

But “something like” is doing a lot of heavy lifting. Do transformers genuinely implement Bayesian reasoning, with the geometric structures that entails? Or do they merely approximate the input-output map of a Bayesian agent using some unrelated computational strategy — a giant lookup table, say, or a bag of heuristics?

A trilogy of papers by Agarwal, Dalal, and Misra (2025–2026) answers this question with unusual precision. Across three tightly linked studies, they establish a unified thesis: gradient descent, applied to the cross-entropy objective, sculpts the internal geometry of transformer attention into Bayesian manifolds, and these structures persist from toy-scale wind tunnels all the way to production-scale language models.

The trilogy proceeds in three acts:

Paper I (The Bayesian Geometry of Transformer Attention, arXiv 2512.22471) — demonstrates that transformers match Bayes-optimal posteriors with sub-bit precision on controlled tasks, identifies three inference primitives, and discovers the three-stage internal mechanism.
Paper II (Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds, arXiv 2512.22473) — derives the gradient-level mechanism by which training shapes attention into these structures, revealing an EM-like feedback loop between attention routing and value geometry.
Paper III (Geometric Scaling of Bayesian Inference in LLMs, arXiv 2512.23752) — tests whether these geometric signatures survive at scale, probing Pythia-410M, Phi-2, Llama-3.2-1B, and Mistral-7B, finding that the core structures are universal.

This post walks through the trilogy's key arguments, formalisms, and empirical results. The goal is not to recapitulate every figure, but to give you a clear picture of what was proven, how it was proven, and why it matters for how we think about transformer reasoning.

The Methodology: Bayesian Wind Tunnels

The central methodological innovation is the Bayesian wind tunnel: a synthetic task engineered so that (a) the Bayes-optimal posterior is known exactly in closed form, (b) the hypothesis space is too large to memorize, and (c) the task genuinely requires inference. If a model matches the optimal posterior on such a task, it cannot be cheating — it must be computing.

Three conditions make a good wind tunnel:

Known posterior. The ground-truth Bayesian posterior is available analytically, so we can measure the model's deviation in bits or nats rather than relying on vague “seems about right” metrics.
Impossible memorization. The hypothesis space is combinatorially large — far exceeding the model's parameter count — ruling out lookup-table strategies.
Genuine inference required. Each observation provides partial evidence that must be integrated sequentially. No shortcut exists.

The trilogy deploys four wind tunnels, each stress-testing a different facet of Bayesian computation:

Bijection elimination — a vocabulary of $V = 20$ symbols mapped through an unknown random bijection. The model must identify the bijection from input-output pairs. The hypothesis space is $20! \approx 2.4 \times 10^{18}$ — vastly larger than any model's parameter count. The Bayesian posterior eliminates one hypothesis per observation, yielding a clean staircase entropy curve.
HMM state tracking — a hidden Markov model with $S = 5$ states and $V = 5$ observations. The Bayes-optimal solution is the forward algorithm, which propagates beliefs through stochastic state transitions. This tests whether a model can perform belief transport — updating through dynamics, not just accumulating static evidence.
Bayesian regression — a continuous task with a known Gaussian posterior, testing the model's ability to track uncertainty in a smooth, real-valued parameter space.
Associative recall — a content-based memory task. The model sees a sequence of key-value bindings and must retrieve the correct value given a query key. This tests random-access binding — the ability to store and retrieve hypotheses by content, not by position.

Why wind tunnels? The analogy is aeronautical: you don't learn about lift by watching a 747 in flight. You strip the problem down to a wing cross-section in controlled airflow. Similarly, wind tunnels isolate inference primitives from the noise of natural-language training, letting you make exact quantitative comparisons against the Bayesian gold standard.

Three Inference Primitives

A key contribution of Paper I is a taxonomy of inference primitives — the atomic computational operations that any system performing Bayesian inference must be able to execute. The papers identify three:

1. Belief Accumulation

The most basic primitive: integrating evidence into a running posterior. Given a sequence of observations $(x_1, y_1), \ldots, (x_t, y_t)$ , the model must maintain and update the posterior $P(\theta \mid x_{1:t})$ . In the bijection task, each observation eliminates a subset of the $V!$ possible bijections. The Bayes-optimal posterior takes the form:

p(\theta \mid c) \propto \pi(\theta) \prod_{(x_i,y_i)\in c} p(y_i \mid x_i, \theta)

For the bijection task, this simplifies beautifully. After seeing $k$ distinct input-output pairs, exactly $(V-k)!$ bijections remain consistent with the evidence, and the predictive entropy follows a clean staircase:

H_{\text{Bayes}}(k) = \log_2(V - k + 1)

Belief accumulation is “easy” in the sense that every architecture tested can do it to some degree — even LSTMs. The hard part is doing it with sub-bit precision, which turns out to require the full transformer.

2. Belief Transport

When the latent state evolves stochastically over time — as in a hidden Markov model — the model must not only incorporate new evidence but also propagate its beliefs through the transition dynamics. The Bayes-optimal solution is the forward algorithm:

\alpha_t(s) \propto E(o_t \mid s) \sum_{s'} T(s \mid s') \alpha_{t-1}(s')

where $E(o_t \mid s)$ is the emission probability, $T(s \mid s')$ is the transition matrix, and $\alpha_{t-1}(s')$ is the previous belief state. This requires multiplying the belief vector by the transition matrix at every step — a matrix-vector product that demands access to the full belief state, not just a running count.

Belief transport is strictly harder than accumulation. Architectures without sufficient representational capacity to store and manipulate the full belief vector (notably LSTMs) fail catastrophically here, while transformers and Mamba both succeed.

3. Random-Access Binding

The third primitive is content-based retrieval: given a set of stored key-value associations, retrieve the value corresponding to a query key. This is the computational backbone of associative recall, and it maps directly onto the query-key-value mechanism of transformer attention.

Random-access binding is the hardest primitive, and it is the one that most sharply differentiates architectures. Transformers achieve 100% accuracy. Mamba manages 97.8% — respectable, but its sequential state-space mechanism introduces small errors that compound. LSTMs collapse to 0.5%, essentially random guessing. The reason is structural: binding requires a mechanism that can route information by content rather than by position, and only the QKV attention mechanism provides this natively.

Inference Primitives: Architectural Realizability

Each row is an inference primitive; each column is an architecture. Transformers are the only architecture that realizes all three primitives. Their dominance arises not from scale alone, but from primitive completeness.

The primitives taxonomy is not just descriptive — it is predictive. It tells you in advance which architectures will succeed on which tasks, based purely on whether they can realize the required primitives. Transformers are the only architecture that realizes all three, which is why they are the only architecture that matches the Bayes-optimal posterior across all four wind tunnels.

Results: Transformers Match Bayes with Sub-Bit Precision

The quantitative results are striking. Across all four wind tunnels, transformers match the Bayes-optimal posterior with errors measured in thousandths of a bit:

Bijection Elimination

The transformer achieves a mean absolute error (MAE) of just $3 \times 10^{-3}$ bits against the optimal staircase entropy, with a KL divergence below 0.01 nats. The predicted entropy curve is visually indistinguishable from the Bayesian gold standard. Crucially, the model generalizes: it was never shown the optimal posterior during training, yet it converges to it.

HMM State Tracking

On the HMM task, the transformer achieves an MAE of $7.5 \times 10^{-5}$ bits — even more precise than on the bijection task. More remarkably, the model generalizes to sequences 2.5 times longer than any it saw during training. This is strong evidence that the model has learned the algorithm (the forward recursion), not just the input-output map for a particular sequence length.

Associative Recall

This is the task where architecture matters most:

Transformer: 100% accuracy. Perfect content-based binding.
Mamba: 97.8%. Near-perfect, but the sequential scan introduces small compounding errors on long sequences.
LSTM: 0.5%. Effectively random. The recurrent bottleneck makes content-based random-access retrieval impossible.

The MLP Control

An MLP (feed-forward only, no attention) fails uniformly across all tasks, with an MAE of 1.85 bits on the bijection task — three orders of magnitude worse than the transformer. This confirms that the performance is not coming from the feed-forward layers alone; the attention mechanism is essential.

Architecture Comparison — Entropy MAE (lower = better) & Recall Accuracy

Bijection/HMM: entropy MAE in bits (lower is better). Recall: accuracy (higher is better). The pattern exactly matches the primitives taxonomy. Note Mamba outperforms the Transformer on HMM transport (0.024 vs 0.049) — its selective state-space mechanism excels at propagating beliefs through dynamics.

A note on Mamba. Mamba actually outperforms the transformer on HMM transport (MAE 0.024 vs 0.049 bits). This makes architectural sense: its selective state-space mechanism is a natural fit for propagating beliefs through linear dynamics. Where Mamba falls short is on binding, where content-based random access is required. The primitives taxonomy captures this precisely.

The Three-Stage Mechanism

Matching the Bayesian posterior tells us that transformers do inference, but not how. Paper I's Section 5 performs a detailed mechanistic dissection, revealing a three-stage architecture that mirrors the structure of Bayes' rule itself.

Three-Stage Mechanism for Bayesian Inference

The computation mirrors Bayes' rule: define the hypothesis space (Layer 0), integrate evidence and eliminate hypotheses (middle layers), then refine the precision of the posterior encoding on a smooth value manifold (late layers). Ablating any layer increases error by >10×.

Stage 1: Foundational Binding (Layer 0)

The first transformer layer constructs an orthogonal key basis over the hypothesis space. In the bijection task, this means creating a set of key vectors where each possible mapping gets its own approximately orthogonal direction. The empirical signature is a 37% reduction in off-diagonal cosine similarity compared to random initialization.

Remarkably, a single attention head — the “hypothesis-frame head” — is responsible for this. Ablating it increases error by over 10 times, while ablating other heads in the same layer has minimal impact. The model has learned to dedicate one head specifically to constructing the coordinate system in which the rest of the computation takes place.

Stage 2: Progressive Elimination (Middle Layers)

The middle layers implement the actual Bayesian update. At each layer, the query-key alignment sharpens: the attention patterns become progressively more selective, focusing on the hypotheses that remain consistent with the evidence. This is measurable as increasing QK alignment scores across depth.

Every middle layer is non-interchangeable: removing any single one increases the error by more than 10 times. This is in sharp contrast to the common finding in other contexts that transformer layers are somewhat redundant. Here, each layer is performing a distinct step in a sequential elimination algorithm.

A critical finding: it is the feed-forward network (FFN), not attention, that performs the actual posterior update computation. Attention's role is routing — determining which information from the context should be delivered to which position. The FFN then takes that routed information and executes the Bayesian update on the residual stream. This division of labor is consistent across all wind tunnels.

Stage 3: Precision Refinement (Late Layers)

The deepest layers exhibit the most surprising behavior: the value manifold unfurls. Early in the network, value vectors are clustered; by the final layers, they spread out along a smooth one-dimensional curve parameterized by posterior entropy. High-entropy states (maximum uncertainty) sit at one end of the curve; low-entropy states (near certainty) sit at the other.

This stage also reveals a phenomenon the authors call frame-precision dissociation: the attention maps (the “frame”) stabilize well before the values (the “precision”) finish refining. The routing pattern converges, but the values continue to adjust — the model keeps getting more precise even after it has decided where to look. This dissociation turns out to be a deep consequence of the gradient dynamics, as Paper II explains.

How Training Sculpts the Geometry

Paper I shows what the geometry looks like; Paper II explains how gradient descent creates it. The key insight is that the gradient of the cross-entropy loss with respect to the attention scores has a specific structure that naturally drives the system toward Bayesian geometry.

The Advantage-Based Routing Law

The gradient of the loss with respect to the pre-softmax attention score $s_{ik}$ (the score from query $i$ to key $k$ ) takes a remarkably clean form:

\frac{\partial L}{\partial s_{ik}} = \alpha_{ik}\bigl(b_{ik} - \mathbb{E}_{\alpha_i}[b]\bigr)

where $\alpha_{ik}$ is the current attention weight and $b_{ik}$ is the “advantage” — the benefit of attending to key $k$ relative to the current weighted average. Unpacking this: the gradient pushes attention toward keys whose values reduce the loss (positive advantage) and away from keys whose values increase it (negative advantage). But it does so multiplicatively: keys that already receive little attention ( $\alpha_{ik} \approx 0$ ) receive proportionally small gradient signals. This is a soft winner-take-all dynamic.

The Value Update

Simultaneously, the value vectors are updated by the rule:

\Delta v_j = -\eta \sum_i \alpha_{ij} u_i

where $u_i$ is the upstream error signal (the gradient of the loss with respect to the attention output at position $i$ ) and $\eta$ is the learning rate. Each value vector moves in the direction that reduces the weighted loss, with the weighting determined by how much attention it currently receives. Values that are attended to heavily get updated more aggressively.

The EM Analogy

Paper II draws a striking connection to the Expectation-Maximization algorithm. The two gradient equations form a coupled feedback loop that mirrors EM:

E-step (attention weights): the attention weights act as soft responsibilities, assigning each query to keys based on the current value vectors. This is the “expectation” — given the current parameters, what is the best routing?
M-step (value updates): the value vectors update as responsibility-weighted prototypes, moving to better represent the targets associated with the queries that attend to them. This is the “maximization” — given the current routing, what are the best values?

The E-step and M-step are not executed sequentially (as in classical EM) but are coupled through the gradient: each step influences the other at every training iteration. This coupled feedback loop is what creates the Bayesian geometry. The routing pushes attention toward hypothesis-consistent keys; the value update pushes values toward posterior-consistent representations; and the two together converge to the Bayesian manifold.

Frame-Precision Dissociation Explained

The EM analogy also explains why the attention frame stabilizes before the values. The score gradient is proportional to $\alpha_{ik}(b_{ik} - \mathbb{E}_{\alpha_i}[b])$ . Once the routing is approximately correct, the advantages $b_{ik}$ become approximately equal for all attended keys (the model is attending to the right places), so the advantage signal $b_{ik} - \mathbb{E}[b]$ approaches zero. The frame freezes.

But the value gradient is proportional to $\alpha_{ij} u_i$ . As long as the upstream error $u_i$ is nonzero — i.e., as long as the model's output is not yet perfect — values keep updating. The frame is a fixed point of the routing dynamics; the values are a fixed point of the full system. The frame converges first because its convergence condition is weaker.

The Content-Based Value Routing Conjecture

In Paper II's Section 9, the authors propose a unifying conjecture: the EM feedback loop is not specific to the transformer's QKV mechanism. Any architecture that implements content-based value routing — i.e., that decides which stored values to retrieve based on the content of the current input, not its position — will develop Bayesian geometry under cross-entropy training. This suggests that the Bayesian structure is not an accident of the transformer architecture, but a consequence of the objective function interacting with any sufficiently expressive routing mechanism. The conjecture would explain why Mamba, despite using a completely different routing mechanism (selective state spaces rather than QKV attention), also develops some of the same geometric signatures.

Does It Scale?

Papers I and II work with small, controlled models trained on wind-tunnel tasks. The obvious question: does any of this survive in production-scale language models trained on internet text? Paper III answers this affirmatively, but with important caveats.

Models Tested

Paper III probes four publicly available models spanning an order of magnitude in parameters: Pythia-410M, Phi-2 (2.7B), Llama-3.2-1B, and the Mistral-7B family. These cover a range of training regimes, data mixes, and architectural choices (full multi-head attention, grouped-query attention, sliding-window attention).

The Domain Restriction Bridge

The first challenge is that production models process mixed-domain text, not controlled wind-tunnel inputs. On mixed-domain data, the Bayesian geometric signatures are present but variable — the fraction of value-manifold variance explained by the first two principal components (PC1+PC2) ranges from 15% to 100% depending on the layer and domain.

The breakthrough insight is domain restriction: when inputs are filtered to a single cognitive domain (specifically, mathematical reasoning), the geometric signatures collapse to near-universality. On math-only inputs, PC1+PC2 explains 70–95% of value-manifold variance across all four models. The value manifold is approximately one-dimensional, just as in the wind tunnels.

This makes intuitive sense. A language model processing mixed-domain text must maintain multiple overlapping hypothesis spaces simultaneously. When restricted to a single domain, the model can focus its representational capacity on a single inference problem, and the clean Bayesian geometry emerges.

The SULA Experiment

The most compelling evidence comes from the SULA (Sequential Update with Latent Adaptation) experiment. The authors construct in-context learning tasks with known Beta-Bernoulli posteriors and feed them to production models as natural- language prompts. As evidence accumulates in the context, the models' internal representations move along the entropy-aligned manifold — exactly as predicted by the wind-tunnel theory.

The MAE between the model's implicit posterior (decoded from its internal representations) and the Bayes-optimal posterior ranges from 0.31 to 0.44 bits across models. This is not sub-bit precision like the wind tunnels, but it is remarkably close for models that were never trained on these tasks.

Crucially, the controls work: shuffling the labels destroys the correlation between internal representations and the Bayesian posterior, as does ablating the evidence from the context. The models are not memorizing a surface pattern; they are performing genuine in-context inference.

Static vs. Dynamic Signatures

Paper III identifies a clean split between two types of Bayesian geometric signatures:

Static signatures (value manifolds and key orthogonality) are universal. They appear in every model tested, regardless of architecture, scale, or training data. These are structural properties of the weight matrices, baked in by training.
Dynamic signatures (attention focusing — the sharpening of attention patterns during inference) depend on the routing architecture. Models with full-sequence multi-head attention (Pythia, Phi-2) show strong dynamic focusing (80–90% entropy reduction). Models with grouped-query attention or sliding-window attention (Llama, Mistral) show weaker dynamic focusing (20–30% entropy reduction).

Bayesian Geometric Signatures Across Production Models

Value manifolds and key orthogonality persist across all architectures — they are structural invariants. Attention focusing varies: strong in full-sequence MHA (Pythia, Phi-2), weak in GQA/sliding-window (Llama, Mistral). This static-dynamic split confirms Paper II's frame-precision dissociation at production scale.

The static-dynamic split has a clear architectural explanation: GQA and sliding-window attention sacrifice per-head routing flexibility for memory efficiency. The value geometry is unaffected (it lives in the value weight matrices, not the attention patterns), but the attention's ability to sharply focus on specific keys is reduced.

Causal Intervention

Perhaps the most thought-provoking result in Paper III: the authors perform a causal intervention, ablating the entropy axis of the value manifold (projecting out the principal component most aligned with posterior entropy). This destroys the geometric structure — PC1+PC2 drops dramatically — but it does not destroy the model's calibration. The model still produces well-calibrated predictions.

This means the value manifold is a privileged readout of the model's internal Bayesian computation, not a computational bottleneck. The Bayesian inference happens in the residual stream; the value manifold is how we can observe it, but it is not the mechanism itself. The model's inference survives the ablation because the computation is distributed across the full residual stream, not concentrated in the value manifold.

The Unified Picture

The trilogy's three papers can be understood as three lemmas in a single structural theorem about transformer inference:

Paper I  (Lemma 1: Existence)  — WHICH architectures implement Bayes
Paper II (Lemma 2: Mechanism)  — HOW gradient descent sculpts the geometry
Paper III (Lemma 3: Scaling)   — WHERE the geometry persists at scale

The structural theorem, restated informally: Any architecture capable of content-based value routing, trained with cross-entropy on tasks requiring sequential evidence integration, will develop internal geometric structures — orthogonal key frames, entropy-parameterized value manifolds, and progressive attention sharpening — that implement Bayesian posterior inference.

The three lemmas are tightly interlocked. Lemma 1 establishes the phenomenon (transformers match Bayes) and characterizes the internal geometry. Lemma 2 derives the mechanism (the EM-like coupled gradient loop) and explains why cross-entropy training necessarily produces this geometry. Lemma 3 validates the theory at scale, confirming that the geometric signatures are not an artifact of toy-scale training but a fundamental property of how these models organize information.

The chain of reasoning is: the cross-entropy objective creates an advantage-based routing gradient (Lemma 2) that drives attention toward hypothesis-consistent keys. This routing gradient, coupled with the responsibility-weighted value update, creates an EM-like loop that converges to Bayesian geometry (Lemma 2). The resulting geometry consists of orthogonal key frames, entropy-parameterized value manifolds, and progressive attention sharpening (Lemma 1). These structures are present in production models, with the static components (key orthogonality, value manifolds) being universal and the dynamic component (attention focusing) depending on the routing architecture (Lemma 3).

Implications

If the trilogy's thesis holds up to further scrutiny, the implications are significant for both theory and practice.

Architecture Selection Principle

The primitives taxonomy provides a principled criterion for choosing architectures. If your task requires all three inference primitives (accumulation, transport, and binding), you need a transformer or something with equivalent content-based routing capability. If your task only requires accumulation and transport (e.g., time-series forecasting with known dynamics), a state-space model like Mamba may be sufficient — and more efficient. The primitives taxonomy transforms architecture selection from an empirical guessing game into a systematic analysis of task requirements.

A Lower Bound for LLM Reasoning

The trilogy establishes that transformers can perform exact Bayesian inference when the task structure permits it. This sets a lower bound on the reasoning capability of transformer-based LLMs: they are at least as powerful as Bayesian inference, on tasks where Bayesian inference is the right framework. This does not mean LLMs are always Bayesian — on out-of-distribution inputs or tasks that do not map cleanly onto Bayesian structure, they may use entirely different strategies. But it does mean that dismissing LLM reasoning as “mere pattern matching” is provably too reductive.

Testable Predictions for Frontier Models

The theory generates concrete, falsifiable predictions. For any new model:

Its value manifold on domain-restricted inputs should be approximately one-dimensional, parameterized by posterior entropy.
Its key matrices should exhibit above-chance orthogonality between semantically distinct hypothesis classes.
The frame-precision dissociation should hold: attention patterns should stabilize before value representations do during training.
GQA/sliding-window models should show weaker dynamic focusing than full-MHA models, with no difference in static signatures.

These predictions can be tested on any new model release. If they consistently hold, the theory gains credibility; if they fail for specific architectures, we learn something about the boundaries of the structural theorem.

Training Data Quality Matters

An underappreciated implication: the Bayesian geometry emerges because the training data contains genuine inferential structure — patterns where evidence accumulates, hypotheses are eliminated, and posteriors sharpen. If training data is low-quality (random, contradictory, or insufficiently structured), the EM-like gradient loop has nothing to latch onto and the geometry will not form. This provides a geometric explanation for the empirical finding that training data quality matters at least as much as quantity: high-quality data provides the inferential scaffolding that the gradient dynamics require to sculpt Bayesian manifolds.

Further reading
Agarwal, Dalal & Misra (2025) — The Bayesian Geometry of Transformer Attention (arXiv 2512.22471)
Agarwal, Dalal & Misra (2025) — Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds (arXiv 2512.22473)
Agarwal, Dalal & Misra (2026) — Geometric Scaling of Bayesian Inference in LLMs (arXiv 2512.23752)