Backpropagation, taught slowly
Every formula in the algorithm exists because of the chain rule. Once you see that, the Greek letters and indices stop being scary.
01 What we're doing
Backpropagation is the procedure that, given a neural network and a single training example, tells you for each weight in the network which direction to nudge it so the network's prediction becomes a little less wrong. That's the whole show. Everything that follows is the mechanics of how that nudging instruction gets computed.
The trick that makes it work — and the reason it has its name — is that the algorithm propagates information backward from the output (where we know how wrong we are) through the network (figuring out who's responsible for the wrongness). That backward flow is the chain rule from calculus, applied to a function that happens to be a composition of layers.
If you remember what a derivative is and what the chain rule does, you'll come out of this tutorial understanding why every step in backpropagation exists — not just how to apply the formulas. The math is high-school calculus; the hard part is the indices and notation.
02 The network
Here is the network we'll use throughout. Two inputs, two hidden neurons, one output. Six weights total. No biases (we'll add them later).
| Symbol | Meaning | Value |
|---|---|---|
| x₁, x₂ | inputs (the data) | 0.35, 0.9 |
| W₁₃ | weight from x₁ into H₃ | 0.1 |
| W₂₃ | weight from x₂ into H₃ | 0.8 |
| W₁₄ | weight from x₁ into H₄ | 0.4 |
| W₂₄ | weight from x₂ into H₄ | 0.6 |
| W₃₅ | weight from H₃ into O₅ | 0.3 |
| W₄₅ | weight from H₄ into O₅ | 0.9 |
| t | target (what we want O₅ to predict) | 0.5 |
| η | learning rate (step size) | 1.0 |
The subscript convention here is source–destination: $W_{13}$ means "from unit 1 to unit 3." Some textbooks reverse this. Mathematically identical, but you have to pick one and stick with it. Whenever you encounter someone else's notation, look at a single concrete example and figure out which direction they meant.
03 Anatomy of a neuron
Each neuron in the network does exactly two things, in order: a weighted sum of its inputs, then a squashing function applied to that sum. That's it.
Step A — weighted sum. Take all inputs, multiply each by its corresponding weight, add them up. We call that sum $a$ (for "pre-activation"):
where $o_i$ is the output of upstream unit $i$. For the first layer, $o_i$ is just the raw input $x_i$.
Step B — squash. Put $a$ through the sigmoid function, which compresses any real number into the open interval $(0, 1)$:
The sigmoid curve is the central visual object of this entire tutorial. Drag the red dot to see how the output and the derivative behave at different inputs:
Why sigmoid?
Three properties matter for the algorithm to work, and sigmoid happens to have all of them:
Without the squashing, a stack of neurons is just a stack of linear combinations — and stacking linear combinations only gives you another linear combination. The non-linearity is what lets the network represent complicated, curvy functions. Modern networks use ReLU or its cousins instead of sigmoid for reasons we'll get to in §15, but the chain-rule machinery is the same.
04 The forward pass
We feed $x_1 = 0.35,\ x_2 = 0.9$ through the network and see what comes out. We work left to right, neuron by neuron.
H₃ first
H₄ next
O₅ last
Now the output neuron's "inputs" are the hidden neurons' outputs:
The network's prediction is 0.69. The target was 0.5. We overshot by 0.19. Here's the same diagram with the computed activations placed inside each neuron:
The backward pass needs the values the forward pass produces — the activation at each neuron, the prediction at the output. Every gradient computation we're about to do refers back to these numbers. Forward then backward is a hard ordering, not a convention.
05 Measuring wrongness
"Wrong by −0.19" is a number, but to do calculus we need a smooth function of the network's prediction that we can minimize. The conventional choice is squared error:
For our example, $E = \tfrac{1}{2}(-0.19)^2 \approx 0.018$. Three questions about this choice deserve quick answers.
So now we have one number, $E$, that says how wrong the network is. We have six weights we can adjust. If we knew, for each weight, how much $E$ changes when that weight changes — we'd know how to nudge each weight to reduce $E$. That "how much $E$ changes" is exactly the partial derivative $\partial E / \partial W$. Our entire job is to compute six of these.
06 The chain rule, by itself
Before we apply the chain rule to neural networks, let's remember what it says about a regular function. Suppose:
where $f$, $g$, $h$ are ordinary functions. As $x$ changes by a tiny amount $\Delta x$:
- $h(x)$ changes by approximately $h'(x)\Delta x$.
- $g(h(x))$ changes by approximately $g'(h(x)) \cdot h'(x)\Delta x$.
- $f(g(h(x)))$ changes by approximately $f'(g(h(x))) \cdot g'(h(x)) \cdot h'(x)\Delta x$.
So the derivative is the product of the local derivatives at each stage:
Concrete example. Let $y = (3x + 1)^2$. We could expand: $y = 9x^2 + 6x + 1$, so $dy/dx = 18x + 6$. Or we could chain it. Let $h(x) = 3x + 1$ and $f(u) = u^2$. Then $h'(x) = 3$ and $f'(u) = 2u$, so:
That's the entire mathematical content of backpropagation. The neural network is a composition of functions:
To find $\partial E / \partial W$ for some weight $W$ deep in the network, we just walk from $E$ back to $W$, multiplying local derivatives as we go. The reason this gets a special name and a special algorithm is that done naively you'd repeat a lot of work across the six weights. Backprop is just the chain rule, organized so we don't repeat work.
07 Gradient descent
Once we know $\partial E / \partial W$ for a weight, what do we do with it? Imagine plotting $E$ as a function of just that one weight, holding everything else fixed. The curve will generally look like a bowl. We're somewhere on the bowl, and we want to be at the bottom.
The rule is one line of math:
The minus sign moves us opposite the slope (downhill). The factor $\eta$ — the learning rate — controls step size. Too small and we crawl; too large and we overshoot.
Drag the slider to change the learning rate and watch what happens. The ball is sitting on a curve $E(W) = \tfrac{1}{2}(W-3)^2$, starting at $W = 8$. Pressing Step performs one gradient-descent update:
Many ML tutorials (including Huddar's video) define a symbol $\delta$ that absorbs the minus sign, so the final update rule reads $W \leftarrow W + \eta\, \delta\, o$ with a plus. Mathematically identical, just bookkeeping. We'll see exactly where the sign migrates in §9.
08 The gradient of one output weight
Let's start with the easiest weight: $W_{45}$, which connects $H_4$ to $O_5$. We want $\partial E / \partial W_{45}$.
$W_{45}$ doesn't affect $E$ directly. It affects $E$ through a chain:
Let's compute each piece. We're just doing calculus on the formulas we already wrote — there is no neural-network magic here.
Piece (1): ∂E/∂y₅
$E = \tfrac{1}{2}(t - y_5)^2$. Differentiating with respect to $y_5$:
The $-1$ is from differentiating $-y_5$ inside the parentheses.
Piece (2): ∂y₅/∂a₅ — the sigmoid derivative
The classical result, derivable in one line with the quotient rule:
The output times one minus the output. Since we already have $y_5$ from the forward pass, we don't need to evaluate $\sigma$ again.
Piece (3): ∂a₅/∂W₄₅
$a_5 = W_{35}\, y_3 + W_{45}\, y_4$. Differentiate with respect to $W_{45}$ and only the second term survives, with $y_4$ as its coefficient:
Putting it together
With our numbers:
Positive — so increasing $W_{45}$ makes things worse. We should decrease it:
Which matches Huddar's video to three decimals. ✓
09 Why δ exists
Look back at the formula we just derived:
The first two factors, $(y_5 - t)\cdot y_5(1-y_5)$, have nothing to do with which output weight we're computing. They'd be exactly the same if we were computing $\partial E / \partial W_{35}$ — only the last factor (which input we're multiplying through) changes. It would be wasteful to recompute them. So we give them a name:
Notice we wrote $(t - y_5)$ instead of $(y_5 - t)$ — that flips the sign of $\delta$ relative to the "true" gradient. It's a convention chosen so the final update rule comes out as a plus sign. (Remember the heads-up at the end of §7.)
With $\delta_5$ defined, both gradients into the output layer have the same shape:
And the update rule, absorbing the minus sign into the convention:
where $o_j$ is the output of the upstream neuron. Same formula for every output-layer weight. The neuron-specific work ($\delta_5$) gets computed once and reused for both weights.
What δ really means
$\delta_j$ is, up to a sign, "how much $E$ would change if neuron $j$'s pre-activation $a_j$ changed by a tiny amount." Equivalently: how much responsibility for the error sits at this neuron.
For our network:
Negative because we overshot — the output is too high. Whichever weights feed into $O_5$ should all move in a direction that decreases $a_5$, and indeed the formula $\Delta W = \eta\, \delta_5\, o$ delivers negative updates when $\delta_5 < 0$ and $o > 0$.
11 What if a neuron feeds many outputs?
In our example, $H_3$ feeds only $O_5$. But suppose there were two output neurons, $O_5$ and $O_6$. Then nudging $y_3$ would affect both, and through them both contributions to the error.
The chain rule has a rule for this: when a variable affects the loss through multiple paths, sum the contributions across the paths.
So the general formula for a hidden neuron's $\delta$ is:
That sum is the only structural difference between a hidden neuron with one downstream unit and one with many. With a single downstream unit, the sum collapses to a single term. The Huddar example has exactly one output, so the sum looks trivial — but the same machinery generalizes to networks of arbitrary depth and width.
12 The complete algorithm
Everything we've derived, compressed:
Forward pass
For each neuron, in order from input to output:
Backward pass — output neurons
Backward pass — hidden neurons
Weight update
Four formulas. The by-hand exercises in the companion file are nothing but careful bookkeeping of these four pieces. The same four formulas, with $\sigma$ replaced by some other activation function and $\frac{1}{2}(t-y)^2$ replaced by some other loss, give you the algorithm used to train every neural network running on Earth right now.
13 Exercise 1, worked end-to-end
Let's redo the example from scratch, doing nothing new — but showing every intermediate number. Read this while writing it out yourself on paper; you'll learn ten times as much from doing the arithmetic as from reading mine.
Step 1 — Forward pass
Step 2 — δ at the output
Negative because we overshot the target.
Step 3 — δ at the hidden neurons
Step 4 — All six weight updates
Apply $\Delta W_{ij} = \eta \cdot \delta_j \cdot o_i$, matching the destination's $\delta$ with the source's output:
| Weight | δ (destination) | o (source) | ΔW | New value |
|---|---|---|---|---|
| W₁₃ | −0.00266 | 0.35 | −0.000931 | 0.0991 |
| W₂₃ | −0.00266 | 0.9 | −0.00239 | 0.7976 |
| W₁₄ | −0.00817 | 0.35 | −0.00286 | 0.3971 |
| W₂₄ | −0.00817 | 0.9 | −0.00735 | 0.5926 |
| W₃₅ | −0.0407 | 0.6803 | −0.0277 | 0.2723 |
| W₄₅ | −0.0407 | 0.6637 | −0.0270 | 0.8730 |
All six weights decreased — consistent with $\delta_5 < 0$ pushing the network to lower its output. The numbers match the video exactly.
Step 5 — Confirm by re-running the forward pass
New error: $0.5 - 0.682 = -0.182$, down from $-0.190$. Error shrunk after a single step. Iterate this thousands of times and the error goes to zero.
that's the algorithm — everything from here is elaboration
14 Watching it train
One iteration shrinks the error by about 4%. So 100 iterations? 1000? Here's what happens if we just keep applying the same procedure over and over to this single example:
What this plot is actually showing: at each iteration we run a forward pass to get a prediction, compute the loss, run a backward pass to get deltas, update all six weights, and repeat. The same arithmetic from §13, just done several thousand times in a fraction of a second. The y-axis is the loss; the x-axis is the iteration count.
In a real training run, you wouldn't have one input and one target — you'd have a dataset. The loss is then averaged across the dataset (or a mini-batch from it), and a single gradient-descent step uses the average gradient. The per-example mechanics are exactly the ones we just walked through; you just sum the gradients across examples before applying the update.
15 The vanishing-gradient problem
Look at the sigmoid derivative $\sigma'(a) = y(1-y)$ as a function. It's a small bump centered at $y = 0.5$, where it peaks at $0.25$. As $y$ moves toward 0 or 1, $\sigma'$ shrinks toward zero quickly:
Now remember the hidden-delta formula:
Every time the gradient passes back through a sigmoid, it gets multiplied by $y(1-y)$, which is at most 0.25 and often much smaller. In a network with many layers, the gradient flowing back to the early layers is the product of many such small factors — it shrinks geometrically with depth. After a handful of layers, the early-layer weights are getting updates of magnitude $10^{-10}$, and they effectively stop learning.
This is the vanishing-gradient problem. It was one of the main reasons deep neural networks didn't work well before about 2010. Two things eventually solved it:
- Better activation functions. ReLU ($\max(0, x)$) has derivative either 0 or 1 — no shrinking in the gradient when the neuron is active. Modern variants (Leaky ReLU, GELU, SiLU) refine this further.
- Better initialization, normalization, and architectures. Skip connections (as in ResNets), layer normalization, careful weight initialization — all designed in part to keep gradients from vanishing or exploding through depth.
The chain-rule machinery we just derived is exactly the same in modern networks. The pieces that change are which activation function lives at each neuron and what other tricks the architecture employs to keep gradients well-behaved. Backprop itself didn't need to be reinvented.
16 Bias terms (the thing we left out)
We've been writing $a_j = \sum_i W_{ij}\, o_i$. A real neuron in a modern network has one more parameter — a bias term $b_j$ that's added regardless of input:
The bias gives the neuron the ability to "fire" even when all inputs are zero, and to shift the sigmoid's operating point left or right. Without it, the neuron's output is forced to be $\sigma(0) = 0.5$ when all inputs vanish — not always what you want.
The good news: backprop doesn't change at all. A bias is just a weight whose "upstream output" is the constant 1. Apply the update rule with $o = 1$:
That's it. Every neuron in a real network has one bias, and biases are trained with the same δ-based updates as the rest of the weights.
17 A vectorized glimpse
So far we've been writing everything element-by-element with subscripts. Real ML frameworks write the same algorithm in matrix-vector form. For a layer with input $\mathbf{o}^{(l-1)}$ and weights $W^{(l)}$:
Forward pass
($\sigma$ applied element-wise to the vector.)
Backward pass
($\odot$ is element-wise multiplication. $(W^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}$ is the matrix-vector form of the summation $\sum_k \delta_k W_{jk}$ — same arithmetic, different notation.)
Weight update
That's the same algorithm we derived in §12, written so that a GPU can do it in parallel. Every line in PyTorch or JAX is a direct translation of these formulas. The matrices get bigger and the activations get fancier, but the structure is unchanged from what you just learned.
18 Common confusions, addressed
weights[layer][destination][source], so weights[0][1][0] is $W_{14}$.- $\delta = y(1-y)(t - y)$, update $W \leftarrow W + \eta\, \delta\, o$ (the one used here).
- $\delta = y(1-y)(y - t)$, update $W \leftarrow W - \eta\, \delta\, o$.
- Loss written $\tfrac{1}{2}(y - t)^2$ instead of $\tfrac{1}{2}(t - y)^2$, with either of the above.
19 Where to go from here
If you've made it this far, you understand backpropagation. The fastest way to make that knowledge stick is to do the by-hand exercises in the companion file:
- Exercise 1 is the example you just saw — do it on paper without looking.
- Exercise 2 is the same network shape with different starting numbers and a smaller η.
- Exercise 3 has two output neurons, which means the hidden-δ formula actually has a sum of two terms in it (not the single-term collapse of Exercises 1 and 2). This is the one that locks the algorithm in your head.
After that, four destinations:
Backpropagation looks intimidating from outside and ridiculously simple once you've done it by hand a few times. The hard part isn't the math — the math is just the chain rule. The hard part is keeping track of all the indices and not letting the notation scare you. You've done that now.