A from-first-principles tutorial

Backpropagation, taught slowly

Every formula in the algorithm exists because of the chain rule. Once you see that, the Greek letters and indices stop being scary.

Companion to Huddar's video ~45 min read 4 interactive diagrams

01 What we're doing

Backpropagation is the procedure that, given a neural network and a single training example, tells you for each weight in the network which direction to nudge it so the network's prediction becomes a little less wrong. That's the whole show. Everything that follows is the mechanics of how that nudging instruction gets computed.

The trick that makes it work — and the reason it has its name — is that the algorithm propagates information backward from the output (where we know how wrong we are) through the network (figuring out who's responsible for the wrongness). That backward flow is the chain rule from calculus, applied to a function that happens to be a composition of layers.

A promise

If you remember what a derivative is and what the chain rule does, you'll come out of this tutorial understanding why every step in backpropagation exists — not just how to apply the formulas. The math is high-school calculus; the hard part is the indices and notation.

02 The network

Here is the network we'll use throughout. Two inputs, two hidden neurons, one output. Six weights total. No biases (we'll add them later).

inputs hidden output W₁₃=0.1 W₁₄=0.4 W₂₃=0.8 W₂₄=0.6 W₃₅=0.3 W₄₅=0.9 x₁ = 0.35 x₂ = 0.9 H₃ H₄ O₅ → y₅
Fig. 1. The architecture. Inputs are fixed by the data; the six weights are what backpropagation will adjust.
SymbolMeaningValue
x₁, x₂inputs (the data)0.35, 0.9
W₁₃weight from x₁ into H₃0.1
W₂₃weight from x₂ into H₃0.8
W₁₄weight from x₁ into H₄0.4
W₂₄weight from x₂ into H₄0.6
W₃₅weight from H₃ into O₅0.3
W₄₅weight from H₄ into O₅0.9
ttarget (what we want O₅ to predict)0.5
ηlearning rate (step size)1.0
Notation

The subscript convention here is source–destination: $W_{13}$ means "from unit 1 to unit 3." Some textbooks reverse this. Mathematically identical, but you have to pick one and stick with it. Whenever you encounter someone else's notation, look at a single concrete example and figure out which direction they meant.

03 Anatomy of a neuron

Each neuron in the network does exactly two things, in order: a weighted sum of its inputs, then a squashing function applied to that sum. That's it.

input₁ input₂ × W₁ × W₂ Σ weighted sum a σ sigmoid output
Fig. 2. One neuron. Three operations: multiply each input by its weight, sum, squash.

Step A — weighted sum. Take all inputs, multiply each by its corresponding weight, add them up. We call that sum $a$ (for "pre-activation"):

$$a_j = \sum_{i} W_{ij}\, o_i$$

where $o_i$ is the output of upstream unit $i$. For the first layer, $o_i$ is just the raw input $x_i$.

Step B — squash. Put $a$ through the sigmoid function, which compresses any real number into the open interval $(0, 1)$:

$$\sigma(a) = \frac{1}{1 + e^{-a}}, \qquad o_j = \sigma(a_j)$$

The sigmoid curve is the central visual object of this entire tutorial. Drag the red dot to see how the output and the derivative behave at different inputs:

Fig. 3. Drag the dot. As you move toward the extremes, the curve flattens and $\sigma'(a) = \sigma(a)(1-\sigma(a))$ shrinks toward zero — this single fact will come back to haunt us in §15.

Why sigmoid?

Three properties matter for the algorithm to work, and sigmoid happens to have all of them:

It's bounded.
Outputs are always in $(0, 1)$. Useful when you want neurons to behave like soft "on/off" switches.
It's smooth and differentiable everywhere.
The algorithm we're about to derive needs derivatives at every neuron. Sharp corners would break it.
Its derivative is unusually convenient.
$\sigma'(a) = \sigma(a) \cdot (1 - \sigma(a)) = o \cdot (1 - o)$. The derivative is just the output times one minus the output — so during backpropagation we don't have to re-evaluate the sigmoid; we already have the output from the forward pass.

Without the squashing, a stack of neurons is just a stack of linear combinations — and stacking linear combinations only gives you another linear combination. The non-linearity is what lets the network represent complicated, curvy functions. Modern networks use ReLU or its cousins instead of sigmoid for reasons we'll get to in §15, but the chain-rule machinery is the same.

04 The forward pass

We feed $x_1 = 0.35,\ x_2 = 0.9$ through the network and see what comes out. We work left to right, neuron by neuron.

H₃ first

$$a_3 = W_{13}\, x_1 + W_{23}\, x_2 = (0.1)(0.35) + (0.8)(0.9) = 0.755$$
$$y_3 = \sigma(0.755) \approx 0.6803$$

H₄ next

$$a_4 = (0.4)(0.35) + (0.6)(0.9) = 0.68, \qquad y_4 = \sigma(0.68) \approx 0.6637$$

O₅ last

Now the output neuron's "inputs" are the hidden neurons' outputs:

$$a_5 = (0.3)(0.6803) + (0.9)(0.6637) = 0.8014, \qquad y_5 = \sigma(0.8014) \approx 0.6903$$

The network's prediction is 0.69. The target was 0.5. We overshot by 0.19. Here's the same diagram with the computed activations placed inside each neuron:

inputs hidden output x₁ = 0.35 x₂ = 0.9 H₃ 0.680 H₄ 0.664 O₅ 0.690 → y₅ activations after the forward pass — data flowed left to right
Fig. 4. Post-forward-pass state. The network outputs 0.69; the target was 0.5; the error is −0.19.
Why forward first?

The backward pass needs the values the forward pass produces — the activation at each neuron, the prediction at the output. Every gradient computation we're about to do refers back to these numbers. Forward then backward is a hard ordering, not a convention.

05 Measuring wrongness

"Wrong by −0.19" is a number, but to do calculus we need a smooth function of the network's prediction that we can minimize. The conventional choice is squared error:

$$E = \tfrac{1}{2}(t - y_5)^2$$

For our example, $E = \tfrac{1}{2}(-0.19)^2 \approx 0.018$. Three questions about this choice deserve quick answers.

Why squared?
Because we want positive and negative errors to count equally as "bad," and we want the function to be differentiable everywhere. Absolute value $|t - y|$ also gives equal treatment to signs, but has a non-differentiable corner at zero. Squared error has no such corner.
Why the ½ in front?
Pure bookkeeping. The factor of 2 from differentiating the square cancels with the ½, leaving a clean $-(t - y)$ instead of $-2(t - y)$. It has no semantic meaning; it just makes the algebra one symbol cleaner.
Why this loss instead of some other?
For regression (predicting a number), squared error is the standard. For classification (predicting a category), the standard is cross-entropy. The whole backprop framework still applies — only the formula for $\partial E / \partial y$ at the output changes.

So now we have one number, $E$, that says how wrong the network is. We have six weights we can adjust. If we knew, for each weight, how much $E$ changes when that weight changes — we'd know how to nudge each weight to reduce $E$. That "how much $E$ changes" is exactly the partial derivative $\partial E / \partial W$. Our entire job is to compute six of these.

06 The chain rule, by itself

Before we apply the chain rule to neural networks, let's remember what it says about a regular function. Suppose:

$$y = f(g(h(x)))$$

where $f$, $g$, $h$ are ordinary functions. As $x$ changes by a tiny amount $\Delta x$:

  • $h(x)$ changes by approximately $h'(x)\Delta x$.
  • $g(h(x))$ changes by approximately $g'(h(x)) \cdot h'(x)\Delta x$.
  • $f(g(h(x)))$ changes by approximately $f'(g(h(x))) \cdot g'(h(x)) \cdot h'(x)\Delta x$.

So the derivative is the product of the local derivatives at each stage:

$$\frac{dy}{dx} = f'(g(h(x))) \cdot g'(h(x)) \cdot h'(x)$$

Concrete example. Let $y = (3x + 1)^2$. We could expand: $y = 9x^2 + 6x + 1$, so $dy/dx = 18x + 6$. Or we could chain it. Let $h(x) = 3x + 1$ and $f(u) = u^2$. Then $h'(x) = 3$ and $f'(u) = 2u$, so:

$$\frac{dy}{dx} = f'(h(x)) \cdot h'(x) = 2(3x+1) \cdot 3 = 6(3x + 1) = 18x + 6 \quad\checkmark$$

That's the entire mathematical content of backpropagation. The neural network is a composition of functions:

$$E \;=\; \tfrac{1}{2}(t - y_5)^2,\quad y_5 = \sigma(a_5),\quad a_5 = W_{35}y_3 + W_{45}y_4,\quad y_3 = \sigma(a_3),\quad \dots$$

To find $\partial E / \partial W$ for some weight $W$ deep in the network, we just walk from $E$ back to $W$, multiplying local derivatives as we go. The reason this gets a special name and a special algorithm is that done naively you'd repeat a lot of work across the six weights. Backprop is just the chain rule, organized so we don't repeat work.

07 Gradient descent

Once we know $\partial E / \partial W$ for a weight, what do we do with it? Imagine plotting $E$ as a function of just that one weight, holding everything else fixed. The curve will generally look like a bowl. We're somewhere on the bowl, and we want to be at the bottom.

The rule is one line of math:

$$W_{\text{new}} = W_{\text{old}} - \eta \cdot \frac{\partial E}{\partial W}$$

The minus sign moves us opposite the slope (downhill). The factor $\eta$ — the learning rate — controls step size. Too small and we crawl; too large and we overshoot.

Drag the slider to change the learning rate and watch what happens. The ball is sitting on a curve $E(W) = \tfrac{1}{2}(W-3)^2$, starting at $W = 8$. Pressing Step performs one gradient-descent update:

W:8.00 E:12.50
Fig. 5. Slide η through different values. Around η = 1.0 the steps land exactly on the minimum in one move. Above η ≈ 2 the ball diverges — overshooting more each step. This is exactly what happens with neural network training too.
Heads-up: a sign convention

Many ML tutorials (including Huddar's video) define a symbol $\delta$ that absorbs the minus sign, so the final update rule reads $W \leftarrow W + \eta\, \delta\, o$ with a plus. Mathematically identical, just bookkeeping. We'll see exactly where the sign migrates in §9.

08 The gradient of one output weight

Let's start with the easiest weight: $W_{45}$, which connects $H_4$ to $O_5$. We want $\partial E / \partial W_{45}$.

$W_{45}$ doesn't affect $E$ directly. It affects $E$ through a chain:

W₄₅ the weight a₅ pre-activation y₅ network output E loss a₅ = W₃₅·y₃ + W₄₅·y₄ y₅ = σ(a₅) E = ½(t − y₅)² ∂a₅/∂W₄₅ = y₄ ∂y₅/∂a₅ = y₅(1−y₅) ∂E/∂y₅ = −(t−y₅) ∂E/∂W₄₅ is the product of these three partials — that is the chain rule
Fig. 6. The chain from $W_{45}$ to $E$ goes through three intermediate nodes. Multiply the three local derivatives, and you have the gradient.
$$\frac{\partial E}{\partial W_{45}} = \underbrace{\frac{\partial E}{\partial y_5}}_{(1)} \cdot \underbrace{\frac{\partial y_5}{\partial a_5}}_{(2)} \cdot \underbrace{\frac{\partial a_5}{\partial W_{45}}}_{(3)}$$

Let's compute each piece. We're just doing calculus on the formulas we already wrote — there is no neural-network magic here.

Piece (1): ∂E/∂y₅

$E = \tfrac{1}{2}(t - y_5)^2$. Differentiating with respect to $y_5$:

$$\frac{\partial E}{\partial y_5} = \tfrac{1}{2} \cdot 2(t - y_5) \cdot (-1) = -(t - y_5) = y_5 - t$$

The $-1$ is from differentiating $-y_5$ inside the parentheses.

Piece (2): ∂y₅/∂a₅ — the sigmoid derivative

The classical result, derivable in one line with the quotient rule:

$$\sigma'(a) = \sigma(a) \cdot (1 - \sigma(a)) = y \cdot (1 - y)$$

The output times one minus the output. Since we already have $y_5$ from the forward pass, we don't need to evaluate $\sigma$ again.

Piece (3): ∂a₅/∂W₄₅

$a_5 = W_{35}\, y_3 + W_{45}\, y_4$. Differentiate with respect to $W_{45}$ and only the second term survives, with $y_4$ as its coefficient:

$$\frac{\partial a_5}{\partial W_{45}} = y_4$$

Putting it together

$$\frac{\partial E}{\partial W_{45}} = (y_5 - t) \cdot y_5(1 - y_5) \cdot y_4$$

With our numbers:

$$= (0.6903 - 0.5)(0.6903)(0.3097)(0.6637) = (0.1903)(0.2138)(0.6637) \approx 0.0270$$

Positive — so increasing $W_{45}$ makes things worse. We should decrease it:

$$W_{45}^{\text{new}} = 0.9 - (1)(0.0270) = 0.873$$

Which matches Huddar's video to three decimals. ✓

09 Why δ exists

Look back at the formula we just derived:

$$\frac{\partial E}{\partial W_{45}} = (y_5 - t) \cdot y_5(1 - y_5) \cdot y_4$$

The first two factors, $(y_5 - t)\cdot y_5(1-y_5)$, have nothing to do with which output weight we're computing. They'd be exactly the same if we were computing $\partial E / \partial W_{35}$ — only the last factor (which input we're multiplying through) changes. It would be wasteful to recompute them. So we give them a name:

$$\delta_5 \;:=\; y_5(1 - y_5)(t - y_5)$$

Notice we wrote $(t - y_5)$ instead of $(y_5 - t)$ — that flips the sign of $\delta$ relative to the "true" gradient. It's a convention chosen so the final update rule comes out as a plus sign. (Remember the heads-up at the end of §7.)

With $\delta_5$ defined, both gradients into the output layer have the same shape:

$$\frac{\partial E}{\partial W_{45}} = -\delta_5 \cdot y_4, \qquad \frac{\partial E}{\partial W_{35}} = -\delta_5 \cdot y_3$$

And the update rule, absorbing the minus sign into the convention:

$$\Delta W_{j5} = +\eta \cdot \delta_5 \cdot o_j$$

where $o_j$ is the output of the upstream neuron. Same formula for every output-layer weight. The neuron-specific work ($\delta_5$) gets computed once and reused for both weights.

What δ really means

$\delta_j$ is, up to a sign, "how much $E$ would change if neuron $j$'s pre-activation $a_j$ changed by a tiny amount." Equivalently: how much responsibility for the error sits at this neuron.

For our network:

$$\delta_5 = (0.6903)(0.3097)(-0.1903) \approx -0.0407$$

Negative because we overshot — the output is too high. Whichever weights feed into $O_5$ should all move in a direction that decreases $a_5$, and indeed the formula $\Delta W = \eta\, \delta_5\, o$ delivers negative updates when $\delta_5 < 0$ and $o > 0$.

10 The gradient of a hidden weight

Now $W_{13}$ — the weight from $x_1$ into $H_3$. The chain is longer:

these three factors are reused from δ₅ W₁₃ the weight a₃ pre-act H₃ y₃ output H₃ a₅ pre-act O₅ y₅ output O₅ E loss ·x₁ ·y₃(1−y₃) ·W₃₅ ·y₅(1−y₅) ·−(t−y₅) ∂E/∂W₁₃ is the product of all five — the last two are exactly what built δ₅
Fig. 7. Five factors instead of three. The last two (in amber) are reused from the output computation. That's the "backprop" trick: don't recompute what you already have.
$$\frac{\partial E}{\partial W_{13}} = \frac{\partial E}{\partial y_5}\frac{\partial y_5}{\partial a_5}\frac{\partial a_5}{\partial y_3}\frac{\partial y_3}{\partial a_3}\frac{\partial a_3}{\partial W_{13}}$$

The first two factors are exactly what we computed for the output. Their product is $-\delta_5$. The remaining three:

  • $\partial a_5 / \partial y_3 = W_{35}$ (from $a_5 = W_{35}y_3 + W_{45}y_4$).
  • $\partial y_3 / \partial a_3 = y_3(1 - y_3)$, the sigmoid derivative again.
  • $\partial a_3 / \partial W_{13} = x_1$ (the upstream output, which here is just the input).
$$\frac{\partial E}{\partial W_{13}} = -\delta_5 \cdot W_{35} \cdot y_3(1-y_3) \cdot x_1$$

The first three factors — $-\delta_5 \cdot W_{35} \cdot y_3(1-y_3)$ — depend on the neuron $H_3$, not on the specific weight $W_{13}$. They'd be identical if we were computing $\partial E / \partial W_{23}$ instead. So we pull them out and give them a name:

$$\delta_3 \;:=\; y_3(1 - y_3) \cdot W_{35} \cdot \delta_5$$

And then:

$$\frac{\partial E}{\partial W_{13}} = -\delta_3 \cdot x_1 \quad\Longrightarrow\quad \Delta W_{13} = +\eta \cdot \delta_3 \cdot x_1$$

The shape of the update rule is the same as before: $\Delta W = \eta \cdot \delta_{\text{destination}} \cdot o_{\text{source}}$. The neuron's $\delta$ does all the bookkeeping; the weight-specific work is just multiplying by the upstream output. This pattern works at every layer in any depth of network.

What the hidden δ formula says, in English

Read the formula $\delta_3 = y_3(1-y_3)\cdot W_{35}\cdot \delta_5$ aloud:

Neuron 3's responsibility for the error equals its own sigmoid derivative ($y_3(1-y_3)$), times the strength of its connection to the unit downstream ($W_{35}$), times that downstream unit's responsibility ($\delta_5$).

The chain rule has handed us a recipe for propagating responsibility backward through the network. Each hidden neuron's $\delta$ is constructed from the $\delta$ of the neurons it feeds. That's the "back-prop" in backpropagation — literally what the algorithm is named after.

Computing both hidden deltas with our numbers:

$$\delta_3 = (0.6803)(0.3197)(0.3)(-0.0407) \approx -0.00266$$
$$\delta_4 = (0.6637)(0.3363)(0.9)(-0.0407) \approx -0.00817$$
A useful intuition

Notice $|\delta_4| > |\delta_3|$, even though both hidden neurons feed the same output. The reason: $W_{45} = 0.9$ is much larger than $W_{35} = 0.3$, so $H_4$ has more influence on the output, and gets a larger share of the blame when the output is wrong. The chain rule is doing this automatically — strong connections route more responsibility through them.

11 What if a neuron feeds many outputs?

In our example, $H_3$ feeds only $O_5$. But suppose there were two output neurons, $O_5$ and $O_6$. Then nudging $y_3$ would affect both, and through them both contributions to the error.

The chain rule has a rule for this: when a variable affects the loss through multiple paths, sum the contributions across the paths.

H₃ O₅ O₆ E W₃₅ W₃₆ forward → δ₅ flows back via W₃₅ δ₆ flows back via W₃₆ δ₃ = y₃(1−y₃) · (W₃₅·δ₅ + W₃₆·δ₆) — sum across all downstream units
Fig. 8. When $H_3$ feeds more than one downstream neuron, the deltas from all of them are summed (each weighted by the connecting weight) to produce $\delta_3$.
$$\frac{\partial E}{\partial y_3} = (-\delta_5)W_{35} + (-\delta_6)W_{36}$$

So the general formula for a hidden neuron's $\delta$ is:

$$\delta_j \;=\; o_j(1 - o_j) \sum_{k \in \text{downstream}} \delta_k \cdot W_{jk}$$

That sum is the only structural difference between a hidden neuron with one downstream unit and one with many. With a single downstream unit, the sum collapses to a single term. The Huddar example has exactly one output, so the sum looks trivial — but the same machinery generalizes to networks of arbitrary depth and width.

12 The complete algorithm

Everything we've derived, compressed:

Forward pass

For each neuron, in order from input to output:

$$a_j = \sum_i W_{ij}\, o_i, \qquad o_j = \sigma(a_j)$$

Backward pass — output neurons

$$\delta_j = o_j(1 - o_j)(t_j - o_j)$$

Backward pass — hidden neurons

$$\delta_j = o_j(1 - o_j) \sum_{k \text{ downstream}} \delta_k\, W_{jk}$$

Weight update

$$W_{ij} \;\leftarrow\; W_{ij} + \eta \cdot \delta_j \cdot o_i$$

Four formulas. The by-hand exercises in the companion file are nothing but careful bookkeeping of these four pieces. The same four formulas, with $\sigma$ replaced by some other activation function and $\frac{1}{2}(t-y)^2$ replaced by some other loss, give you the algorithm used to train every neural network running on Earth right now.

13 Exercise 1, worked end-to-end

Let's redo the example from scratch, doing nothing new — but showing every intermediate number. Read this while writing it out yourself on paper; you'll learn ten times as much from doing the arithmetic as from reading mine.

Step 1 — Forward pass

$$a_3 = (0.1)(0.35) + (0.8)(0.9) = 0.755 \;\Rightarrow\; y_3 = \sigma(0.755) \approx 0.6803$$
$$a_4 = (0.4)(0.35) + (0.6)(0.9) = 0.68 \;\Rightarrow\; y_4 = \sigma(0.68) \approx 0.6637$$
$$a_5 = (0.3)(0.6803) + (0.9)(0.6637) = 0.8014 \;\Rightarrow\; y_5 = \sigma(0.8014) \approx 0.6903$$

Step 2 — δ at the output

$$\delta_5 = (0.6903)(0.3097)(0.5 - 0.6903) \approx -0.0407$$

Negative because we overshot the target.

Step 3 — δ at the hidden neurons

$$\delta_3 = (0.6803)(0.3197)(0.3)(-0.0407) \approx -0.00266$$
$$\delta_4 = (0.6637)(0.3363)(0.9)(-0.0407) \approx -0.00817$$

Step 4 — All six weight updates

Apply $\Delta W_{ij} = \eta \cdot \delta_j \cdot o_i$, matching the destination's $\delta$ with the source's output:

Weightδ (destination)o (source)ΔWNew value
W₁₃−0.002660.35−0.0009310.0991
W₂₃−0.002660.9−0.002390.7976
W₁₄−0.008170.35−0.002860.3971
W₂₄−0.008170.9−0.007350.5926
W₃₅−0.04070.6803−0.02770.2723
W₄₅−0.04070.6637−0.02700.8730

All six weights decreased — consistent with $\delta_5 < 0$ pushing the network to lower its output. The numbers match the video exactly.

Step 5 — Confirm by re-running the forward pass

$$y_3' \approx 0.6797,\quad y_4' \approx 0.6620,\quad y_5' \approx 0.6820$$

New error: $0.5 - 0.682 = -0.182$, down from $-0.190$. Error shrunk after a single step. Iterate this thousands of times and the error goes to zero.

that's the algorithm — everything from here is elaboration

14 Watching it train

One iteration shrinks the error by about 4%. So 100 iterations? 1000? Here's what happens if we just keep applying the same procedure over and over to this single example:

iter:0 E:0.0181
Fig. 9. Loss versus iteration, running the same backprop step repeatedly with the Huddar starting weights. At η = 1 the loss falls fast and smooth. Crank η too high and watch the error oscillate or explode. Drop η too low and it crawls forever.

What this plot is actually showing: at each iteration we run a forward pass to get a prediction, compute the loss, run a backward pass to get deltas, update all six weights, and repeat. The same arithmetic from §13, just done several thousand times in a fraction of a second. The y-axis is the loss; the x-axis is the iteration count.

What changes with multiple training examples

In a real training run, you wouldn't have one input and one target — you'd have a dataset. The loss is then averaged across the dataset (or a mini-batch from it), and a single gradient-descent step uses the average gradient. The per-example mechanics are exactly the ones we just walked through; you just sum the gradients across examples before applying the update.

15 The vanishing-gradient problem

Look at the sigmoid derivative $\sigma'(a) = y(1-y)$ as a function. It's a small bump centered at $y = 0.5$, where it peaks at $0.25$. As $y$ moves toward 0 or 1, $\sigma'$ shrinks toward zero quickly:

Fig. 10. The sigmoid (teal) and its derivative (amber). When the input is large in magnitude, the neuron saturates — $\sigma$ is near 0 or 1 and $\sigma'$ is microscopic. Drag to see this happen.

Now remember the hidden-delta formula:

$$\delta_j = y_j(1 - y_j) \sum_k \delta_k\, W_{jk}$$

Every time the gradient passes back through a sigmoid, it gets multiplied by $y(1-y)$, which is at most 0.25 and often much smaller. In a network with many layers, the gradient flowing back to the early layers is the product of many such small factors — it shrinks geometrically with depth. After a handful of layers, the early-layer weights are getting updates of magnitude $10^{-10}$, and they effectively stop learning.

This is the vanishing-gradient problem. It was one of the main reasons deep neural networks didn't work well before about 2010. Two things eventually solved it:

  • Better activation functions. ReLU ($\max(0, x)$) has derivative either 0 or 1 — no shrinking in the gradient when the neuron is active. Modern variants (Leaky ReLU, GELU, SiLU) refine this further.
  • Better initialization, normalization, and architectures. Skip connections (as in ResNets), layer normalization, careful weight initialization — all designed in part to keep gradients from vanishing or exploding through depth.

The chain-rule machinery we just derived is exactly the same in modern networks. The pieces that change are which activation function lives at each neuron and what other tricks the architecture employs to keep gradients well-behaved. Backprop itself didn't need to be reinvented.

16 Bias terms (the thing we left out)

We've been writing $a_j = \sum_i W_{ij}\, o_i$. A real neuron in a modern network has one more parameter — a bias term $b_j$ that's added regardless of input:

$$a_j = b_j + \sum_i W_{ij}\, o_i$$

The bias gives the neuron the ability to "fire" even when all inputs are zero, and to shift the sigmoid's operating point left or right. Without it, the neuron's output is forced to be $\sigma(0) = 0.5$ when all inputs vanish — not always what you want.

The good news: backprop doesn't change at all. A bias is just a weight whose "upstream output" is the constant 1. Apply the update rule with $o = 1$:

$$\Delta b_j = \eta \cdot \delta_j \cdot 1 = \eta \cdot \delta_j$$

That's it. Every neuron in a real network has one bias, and biases are trained with the same δ-based updates as the rest of the weights.

17 A vectorized glimpse

So far we've been writing everything element-by-element with subscripts. Real ML frameworks write the same algorithm in matrix-vector form. For a layer with input $\mathbf{o}^{(l-1)}$ and weights $W^{(l)}$:

Forward pass

$$\mathbf{a}^{(l)} = W^{(l)}\, \mathbf{o}^{(l-1)} + \mathbf{b}^{(l)}, \qquad \mathbf{o}^{(l)} = \sigma(\mathbf{a}^{(l)})$$

($\sigma$ applied element-wise to the vector.)

Backward pass

$$\boldsymbol{\delta}^{(L)} = \mathbf{o}^{(L)} \odot (1 - \mathbf{o}^{(L)}) \odot (\mathbf{t} - \mathbf{o}^{(L)}) \quad \text{(output layer)}$$
$$\boldsymbol{\delta}^{(l)} = \mathbf{o}^{(l)} \odot (1 - \mathbf{o}^{(l)}) \odot \left( (W^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)} \right) \quad \text{(hidden layers)}$$

($\odot$ is element-wise multiplication. $(W^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}$ is the matrix-vector form of the summation $\sum_k \delta_k W_{jk}$ — same arithmetic, different notation.)

Weight update

$$W^{(l)} \leftarrow W^{(l)} + \eta \cdot \boldsymbol{\delta}^{(l)} \, (\mathbf{o}^{(l-1)})^\top, \qquad \mathbf{b}^{(l)} \leftarrow \mathbf{b}^{(l)} + \eta \cdot \boldsymbol{\delta}^{(l)}$$

That's the same algorithm we derived in §12, written so that a GPU can do it in parallel. Every line in PyTorch or JAX is a direct translation of these formulas. The matrices get bigger and the activations get fancier, but the structure is unchanged from what you just learned.

18 Common confusions, addressed

Which index of $W$ is source and which is destination?
Conventions differ. Huddar (and this tutorial) uses source–destination: $W_{13}$ is "from 1 to 3." Some textbooks reverse this. Mathematically identical. Always look at a single concrete example to figure out which way someone meant. In code, the array layout matters too — in the companion Python verifier, weights[layer][destination][source], so weights[0][1][0] is $W_{14}$.
Why does the update use the upstream output and not the downstream activation?
Because $\partial a_j / \partial W_{ij} = o_i$. The weight $W_{ij}$ multiplies $o_i$ in the formula for $a_j$, so the derivative of $a_j$ with respect to that weight is exactly $o_i$. Whatever the weight multiplied in the forward pass is what the gradient "sees" at that weight.
Why does $y(1-y)$ keep showing up?
It's the sigmoid derivative at the current operating point. Every neuron has a sigmoid, and every gradient flowing back through that neuron is scaled by its derivative. Since $\sigma'(a) = y(1-y)$ and we already have $y$ from the forward pass, no extra computation is needed.
Why do different sources show different signs?
Three equivalent conventions cause this:
  1. $\delta = y(1-y)(t - y)$, update $W \leftarrow W + \eta\, \delta\, o$ (the one used here).
  2. $\delta = y(1-y)(y - t)$, update $W \leftarrow W - \eta\, \delta\, o$.
  3. Loss written $\tfrac{1}{2}(y - t)^2$ instead of $\tfrac{1}{2}(t - y)^2$, with either of the above.
If your signs come out flipped, you've probably just swapped one convention. Pick one and write it on a sticky note.
How does this scale to many training examples?
The procedure above runs on a single (input, target) pair. With a dataset of $N$ pairs, the loss becomes the average: $E_{\text{total}} = \tfrac{1}{N}\sum_n E_n$. Gradients add linearly, so you can either compute the gradient for each example and average (full-batch gradient descent), or compute the gradient for one example at a time (stochastic gradient descent), or for a small batch (mini-batch SGD — what's used in practice). The mechanics per example are exactly what we just did.
Is there a difference between "error," "cost," "loss," and "objective"?
In everyday use, no. Some authors reserve "loss" for the per-example quantity and "cost" for the averaged version across the dataset; "objective" usually emphasizes that we're minimizing it. The math is the same; only the scope changes.

19 Where to go from here

If you've made it this far, you understand backpropagation. The fastest way to make that knowledge stick is to do the by-hand exercises in the companion file:

  • Exercise 1 is the example you just saw — do it on paper without looking.
  • Exercise 2 is the same network shape with different starting numbers and a smaller η.
  • Exercise 3 has two output neurons, which means the hidden-δ formula actually has a sum of two terms in it (not the single-term collapse of Exercises 1 and 2). This is the one that locks the algorithm in your head.

After that, four destinations:

Andrej Karpathy's micrograd video.
100 lines of Python that implements automatic differentiation from scratch. After doing backprop by hand, watching Karpathy build the engine feels like the formal proof of what you already know intuitively. Search "Karpathy micrograd" on YouTube.
Cross-entropy loss and softmax outputs.
When you move from regression to classification, the loss changes from squared error to cross-entropy and the output activation usually changes from sigmoid to softmax. The framework you just learned still applies — only $\delta$ at the output layer changes. Famously, with softmax + cross-entropy, $\delta_j$ simplifies all the way down to $\hat{y}_j - y_j$. Cleaner than what we did here.
Modern optimizers.
Plain gradient descent with a fixed learning rate is what we just derived. Real ML uses variants — momentum, RMSprop, Adam — that add memory of past gradients to make convergence faster and more robust. The backprop step is unchanged; what changes is how the gradient is used in the update.
Beyond fully-connected layers.
CNNs (image processing) and transformers (LLMs) have different forward-pass mechanics than what we did here, but they're all still trained by chain-rule-based backpropagation. Once you understand it for a fully-connected network, the others are just different patterns of multiplications.

Backpropagation looks intimidating from outside and ridiculously simple once you've done it by hand a few times. The hard part isn't the math — the math is just the chain rule. The hard part is keeping track of all the indices and not letting the notation scare you. You've done that now.