A series in three parts

Backpropagation, from first principles

Three companion tutorials that build the math of neural-network training from the ground up — no prerequisites past high-school algebra.

Most explanations of backpropagation jump straight to the algorithm and leave the calculus implicit. This series goes the other way: it builds the small pieces of math you need first — slopes, derivatives, the chain rule — and then shows that backpropagation is nothing more than those pieces, applied to the layers of a network.

Each tutorial is self-contained, with interactive D3 figures and worked examples that you can read in 30–45 minutes. By the end of the third one, you will have computed a full backward pass by hand and understood why every formula is the shape it is.

The three parts

Derivatives, taught slowly

Slope of a straight line, tangent to a curve, the limit definition. From there to the power rule, the exponential, and the sigmoid derivative — the workhorse of every neural-network activation. Closes with partial derivatives and the gradient vector.

16 sections 4 interactive figures ~35 min

read →

The chain rule, taught slowly

Why rates of change multiply across a composition of functions. Four worked examples — including the full derivation of σ′(x) as a three-stage chain — plus the multivariate version that backpropagation actually uses: multiply along paths, sum across paths.

15 sections ~4 interactive figures ~35 min

read →

Backpropagation, taught slowly

A six-weight network traced end-to-end: forward pass, loss, output δ, hidden δ, six weight updates, every number shown. Why δ exists, what vanishing gradients are, how bias terms slot in, and the vectorized form that ships in PyTorch.

19 sections 4 interactive figures ~45 min

read →

How to read this

If you already know what a derivative is and what the chain rule does, you can jump straight to Part 3 (Backpropagation). If either of those is rusty, start at the beginning — each part takes the next part as its only prerequisite, and the cross-references are tight.

total reading time about two hours · all three pages use the same vocabulary and notation