How a Transformer Learns Addition
Building every piece, then breaking it
Preface
This book builds a transformer from scratch using the simplest possible task: three-digit addition. The architecture is identical to GPT, Llama, and Claude — just smaller. Every component you build here scales directly to the models that generate human-like text.
Why addition? Natural language hides the transformer’s mechanics behind the complexity of semantics. Addition strips that away. The task is fully specified, the algorithm is known (grade-school column addition with carries), and the models are small enough to inspect every weight.
Why transformers?
What problem do they solve? Sequential data — text, time series, code — where each element’s meaning depends on context. Before transformers, recurrent networks (RNNs, LSTMs) processed one step at a time, compressing the entire history into a fixed-size hidden state. That bottleneck limits what the model can remember. Transformers process all positions in parallel via attention: every position can directly access every other position, with no compression.
Why did they take over? Two properties, together: (1) attention removes the sequential bottleneck, so long-range dependencies are as cheap as short-range ones, and (2) the entire computation is parallelizable on GPUs. These made scaling to billions of parameters feasible. RNNs are inherently sequential — you can’t compute step \(t\) until step \(t-1\) finishes. Transformers compute all steps at once.
Statistical framing. A transformer is a parameterized function family \(f_\theta: \mathcal{X} \to \mathcal{Y}\). The architecture defines which functions are representable (the hypothesis class); gradient descent plus data selects one. This is nonparametric regression with a very expressive basis — the “nonparametric” part comes from the fact that adding layers and heads expands the class continuously, and the number of effective parameters adapts to the data.
Why should a statistician care? Transformers are now components in statistical pipelines — synthetic data generation, missing data imputation, feature extraction for downstream models. Understanding the internals tells you when to trust the outputs, the same way knowing the assumptions behind a Bayesian model tells you when the posterior is meaningful. A transformer that gets the right answer for the wrong reason (memorization vs. algorithm) will fail silently on out-of-distribution inputs. This book teaches you to tell the difference.
What you’ll build:
- A tokenizer that encodes
123+456=as a sequence of integers - Token and position embeddings
- Self-attention (by hand, then as a module)
- Feed-forward networks, residual connections, layer normalization
- A complete decoder-only transformer
- Training on 50,000 addition problems
- Visualizations of what the model learns internally
What you’ll learn:
- Representation choices (how you encode the problem) matter more than architecture
- Attention patterns are interpretable — the model learns column alignment
- Each component is independently necessary (you’ll prove this by removing them)
- The same architecture learns completely different algorithms for different tasks
- The gap between this 17,000-parameter model and GPT-4 is scale, not architecture
What you should take away from this book
These five lessons generalize far beyond addition:
Representation choices are statistical parameterization. Reversing the output digits transforms a nearly impossible learning problem into a trivial one. Same model, same data, same optimizer. The encoding is a modeling decision with the same status as choosing a link function in a GLM or a prior in a Bayesian model. When you use transformers in practice, how you tokenize and format input is the highest-leverage design choice.
Attention is learned, dynamic feature selection. A linear regression decides once which features matter. Attention re-decides for every input — the ones-digit output attends to ones-column inputs, while the tens-digit output attends to tens-column inputs, all using the same weight matrices. This is directly inspectable, which makes transformers more interpretable than most deep learning architectures.
Each component is necessary, and ablation proves it. Removing positional embeddings, the causal mask, or a layer each breaks the model in a specific, predictable way. If you can predict the failure mode before running the ablation, you understand the component. This is the standard for understanding any model you use in practice.
Same architecture, different data produces different learned programs. The transformer is a hypothesis class, not a fixed algorithm. Addition data produces column-alignment attention; reversal data produces anti-diagonal attention. The architecture defines what’s possible; the data selects what’s actual. This is why the same GPT architecture can write poetry, solve math, and generate code.
Scale is the only difference from GPT-4. Every component in this 17,000-parameter model — token embeddings, positional encodings, causal self-attention, feed-forward networks, residual connections, layer normalization — appears identically in models with billions of parameters. The architecture is the same; the numbers are bigger.
How to use this book
Read the chapters in order. Each one builds on the previous. The code executes top-to-bottom; all shared infrastructure lives in _common.py.
If you want to run the code interactively, the .qmd files work as Quarto notebooks or can be converted to Jupyter.
Prerequisites
- Python, PyTorch, matplotlib, numpy, scikit-learn
- Basic linear algebra (matrix multiplication, dot products)
- No prior neural network experience required — the book introduces gradient descent, loss functions, and backpropagation as needed