The AGI Scientist
All guides
PrimerThe AGI Scientist · June 20, 2026 · 12 min read

Transformers from first principles

What attention actually computes, built up from nothing — no prior deep-learning background assumed.

Transformers from first principles

If you can multiply matrices and you're willing to think carefully, you can understand a transformer. This primer builds one up from the ground, skipping the jargon until you've seen the machinery.

The one idea: attention

At its core, attention is a weighted lookup. Each token asks a question (a query), every token offers a label (a key), and the match between them decides how much of each token's value gets mixed into the answer. That's it — the rest is bookkeeping.

Why it works

  • It's content-addressed. Tokens attend to what's relevant, not just to what's nearby. Long-range structure becomes reachable in one step.
  • It's parallel. Unlike a recurrent network, every position is computed at once — which is why transformers scale on modern hardware.
  • It composes. Stack attention with simple feed-forward layers and repeat, and representations get richer at every layer.

What to read next

Once the mechanics click, the interesting questions are mechanistic: which heads do what, and how do circuits form? That's where interpretability begins — see the open experiments in the research feed.