Foundations for backpropagation · part 2 of 3

The chain rule, taught slowly

If you can compute the slope of one function, the chain rule tells you the slope of any chain of functions. That is mechanically the whole content of backpropagation.

Part of the backprop series ~35 min read 3 interactive figures

01 Why the chain rule?

The previous tutorial gave us a small toolkit: power rule, sum rule, sigmoid derivative. With those alone, we can differentiate simple functions. But the functions that show up in neural networks are not simple — they are compositions, layer upon layer of one function applied to the output of another.

A neural network's loss is a function of the network's output, which is a function of the last layer's pre-activation, which is a function of the previous layer's output, which is itself a function of weights and inputs… A typical network has dozens of these stages. To compute the slope of "loss" with respect to a weight buried deep inside, we need a rule for differentiating compositions. That rule is the chain rule.

The whole idea, in one sentence

If $y$ depends on $u$, and $u$ depends on $x$, then a small change in $x$ causes a small change in $u$, which causes a small change in $y$. The rate at which $y$ changes with $x$ is the rate at which $y$ changes with $u$, multiplied by the rate at which $u$ changes with $x$. That's the chain rule.

02 Composition of functions

Two functions compose when you feed the output of one into the input of the other. If $f$ takes a number and squares it, and $g$ takes a number and adds 1, then their composition $f(g(x))$ takes $x$, adds 1, and squares the result:

$$g(x) = x + 1, \qquad f(u) = u^2, \qquad f(g(x)) = (x + 1)^2$$

The notation $f \circ g$ — read "$f$ composed with $g$" or "$f$ of $g$" — is sometimes used, but $f(g(x))$ is clearer when you're learning.

x input g(x) inner function f(g(x)) outer function y "first apply g" "then apply f" a composition is just a pipeline — output of one is input of the next
Fig. 1. Composition. Read right-to-left: $f(g(x))$ means "first $g$, then $f$." If you bring this picture to mind every time you see a chain rule formula, the indices stop being confusing.

Compositions are everywhere in calculus because real functions are usually built from simple ones. $\sin(x^2)$ is a composition: first square $x$, then take the sine. $e^{-x^2/2}$ is a composition: first square $x$, then negate and halve, then exponentiate. The sigmoid $\sigma(x) = 1/(1 + e^{-x})$ is a composition four layers deep.

03 Rates of change multiply

Before we state the chain rule, let's earn the intuition for why rates of change multiply across a chain. A small story:

Story

You're walking through a forest. For every meter you walk forward, you climb $0.1$ meters in altitude. For every meter you climb in altitude, the temperature drops by $0.6°$. So: for every meter you walk forward, the temperature drops by how much?

$0.1 \times 0.6 = 0.06°$ per meter walked.

The rates compose by multiplication. That's all the chain rule is saying — applied to whatever variables we like.

Notice the structure: we have a chain "distance → altitude → temperature," and the rate "distance → temperature" is the product of the rate "distance → altitude" times the rate "altitude → temperature." Same multiplicative pattern when you have functions:

$$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$$

where $y$ depends on $u$ which depends on $x$. The Leibniz notation makes this look almost obvious — the $du$'s "cancel" symbolically, which is a useful mnemonic even though it's not how the derivation actually works.

u = 2x:2.00 y = u²:4.00
Fig. 2. A two-stage chain. Slide $x$: the middle box ($u = 2x$) responds at rate $du/dx = 2$, then the right box ($y = u^2$) responds at rate $dy/du = 2u$. The end-to-end rate is the product $2 \cdot 2u = 4u = 8x$, which exactly matches differentiating $y = (2x)^2 = 4x^2$ directly: $dy/dx = 8x$.

04 The chain rule, stated

The official form. Let $y = f(u)$ and $u = g(x)$. Then $y = f(g(x))$, and:

$$\frac{dy}{dx} \;=\; f'(g(x)) \cdot g'(x)$$

Read it as: differentiate the outer function $f$ at the inner value $g(x)$, then multiply by the derivative of the inner function.

Or equivalently, in Leibniz notation:

$$\frac{dy}{dx} \;=\; \frac{dy}{du} \cdot \frac{du}{dx}$$

These are the same rule. The first is more careful about what's a function of what; the second is more compact and works mechanically.

A reading rule of thumb

When you see a composition like $f(g(x))$, ask: what's the "outer" function, what's the "inner" function? The derivative is the outer's derivative at the inner's value, times the inner's derivative. The phrase "at the inner's value" is what trips most people up — it doesn't mean evaluate the outer derivative at $x$, it means evaluate it at $g(x)$.

05 Where the formula comes from

The intuition from §3 (rates multiply) is fine, but here is the actual derivation. It's two lines of algebra. Start with the limit definition from the previous tutorial, applied to $f(g(x))$:

$$\frac{d}{dx} f(g(x)) = \lim_{h \to 0} \frac{f(g(x+h)) - f(g(x))}{h}$$

Now do a trick. Multiply and divide by the change in $g$:

$$= \lim_{h \to 0} \frac{f(g(x+h)) - f(g(x))}{g(x+h) - g(x)} \cdot \frac{g(x+h) - g(x)}{h}$$

The second factor, as $h \to 0$, is exactly $g'(x)$ by definition. The first factor — as $h \to 0$, $g(x+h)$ approaches $g(x)$, so we're looking at the slope of $f$ between two points that are getting closer and closer together; that's the definition of $f'$ at $g(x)$. So the limit is $f'(g(x))$.

$$\frac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x)$$

That's the proof. A real proof needs more care (what if $g(x+h) = g(x)$ for some sequence?), but this captures the substance: the change in $f \circ g$ over a tiny step $h$ is the change in $f$ times the change in $g$.

06 Worked example 1: (3x+1)²

Differentiate $y = (3x + 1)^2$ using the chain rule. Identify the outer and inner functions:

  • Inner: $g(x) = 3x + 1$. Then $g'(x) = 3$.
  • Outer: $f(u) = u^2$. Then $f'(u) = 2u$.

Apply the rule:

$$\frac{dy}{dx} = f'(g(x)) \cdot g'(x) = 2(3x + 1) \cdot 3 = 6(3x + 1) = 18x + 6$$

Check by expanding first: $(3x + 1)^2 = 9x^2 + 6x + 1$, whose derivative is $18x + 6$. ✓

Reading the chain rule in this example: differentiate the outer function ($u^2$ becomes $2u$), plug in the inner ($u = 3x + 1$, giving $2(3x+1)$), multiply by the inner's derivative ($\cdot 3$).

07 Worked example 2: sin(x²)

Differentiate $y = \sin(x^2)$.

  • Inner: $g(x) = x^2$. Then $g'(x) = 2x$.
  • Outer: $f(u) = \sin(u)$. Then $f'(u) = \cos(u)$.
$$\frac{dy}{dx} = \cos(x^2) \cdot 2x = 2x\cos(x^2)$$

Notice we did not get $\cos(x) \cdot 2x$. The outer derivative is evaluated at the inner value, which is $x^2$, not at $x$. This is the single most common mistake in chain-rule problems. The inner function reappears inside the outer derivative.

The classic mistake

"Differentiate $\sin(x^2)$? Easy — $\cos(x^2)$." That answer forgot the factor of $g'(x) = 2x$. Always ask: is what I'm differentiating a composition? If yes, the chain rule's second factor is required.

08 Worked example 3: e−x²

The Gaussian's exponent. $y = e^{-x^2}$.

  • Inner: $g(x) = -x^2$. Then $g'(x) = -2x$.
  • Outer: $f(u) = e^u$. Then $f'(u) = e^u$.
$$\frac{dy}{dx} = e^{-x^2} \cdot (-2x) = -2x\,e^{-x^2}$$

The general formula $\frac{d}{dx} e^{g(x)} = e^{g(x)} \cdot g'(x)$ — exponential of anything, differentiated, is the same exponential times the derivative of the inner — is worth memorizing. It shows up constantly in probability, physics, and ML.

09 Worked example 4: the sigmoid derivative

In the derivatives tutorial we computed $\sigma'(x) = \sigma(x)(1 - \sigma(x))$ but glossed over the chain-rule step. Let's redo it carefully now.

Write $\sigma(x) = (1 + e^{-x})^{-1}$. This is a composition of three things:

  • $h(x) = -x$ → $h'(x) = -1$.
  • $g(u) = 1 + e^u$ → $g'(u) = e^u$.
  • $f(v) = v^{-1}$ → $f'(v) = -v^{-2}$.

We want to differentiate $\sigma(x) = f(g(h(x)))$. Applying the chain rule across three stages (next section explains the general pattern):

$$\sigma'(x) = f'(g(h(x))) \cdot g'(h(x)) \cdot h'(x)$$

Substitute back:

$$= -\bigl(1 + e^{-x}\bigr)^{-2} \cdot e^{-x} \cdot (-1) = \frac{e^{-x}}{(1+e^{-x})^2}$$

Now algebraically rewrite the numerator as $(1+e^{-x}) - 1$ and split:

$$= \frac{1}{1+e^{-x}} - \frac{1}{(1+e^{-x})^2} = \sigma(x) - \sigma(x)^2 = \sigma(x)\bigl(1 - \sigma(x)\bigr) \quad\checkmark$$

three stages of the chain rule, one algebraic flourish — and you have the result that powers every sigmoid neuron in every neural network

10 Multi-stage chains

What if a function is a composition of more than two things? Just apply the chain rule repeatedly. For $y = f(g(h(x)))$:

$$\frac{dy}{dx} \;=\; f'(g(h(x))) \cdot g'(h(x)) \cdot h'(x)$$

For any number of stages, the pattern is identical: the derivative is the product of the local derivatives at each stage, each one evaluated at the appropriate intermediate value.

x h(x) g(h(x)) f(g(h(x))) y ·h′(x) ·g′(h(x)) ·f′(g(h(x))) dy/dx is the product of the three local derivatives — the chain rule, three stages
Fig. 3. Three-stage chain. To compute $dy/dx$, multiply the local derivatives one per stage. Each derivative is evaluated at the value flowing in at that point — $h$ is evaluated at $x$, $g$ at $h(x)$, $f$ at $g(h(x))$.

You can keep stacking. A 10-stage chain produces a product of 10 derivatives. That's exactly what backpropagation does in a 10-layer neural network — it computes the gradient of the loss with respect to a deep weight as a product of 10 local derivatives, one per layer.

11 The Leibniz notation

The same chain rule, written in Leibniz notation, looks almost like cancellation:

$$\frac{dy}{dx} \;=\; \frac{dy}{du} \cdot \frac{du}{dx}$$

Imagine the $du$'s "canceling" — it's not actually a fraction, but the mnemonic works. For three stages:

$$\frac{dy}{dx} = \frac{dy}{dv} \cdot \frac{dv}{du} \cdot \frac{du}{dx}$$

Each new intermediate variable inserts a new factor. This notation is helpful because it makes what's a function of what explicit. In backpropagation, this is exactly the notation we use to track gradients flowing back through a network — the loss is a function of the output, which is a function of the pre-activation, which is a function of the weight, and the chain has a term for each link.

Bookkeeping helps

When a problem has multiple variables and compositions, writing each intermediate variable explicitly prevents errors. Don't try to do the chain rule "all in your head." Write down: "$u = $ ...", "$v = $ ...", then differentiate piece by piece.

12 Multivariate chain rule

Real neural networks have functions of many variables, not just one. So we need a multivariate version of the chain rule. The principle is the same — rates multiply — but with one extra twist: when a variable affects the output through multiple paths, the contributions add.

Concretely, let $z = f(u, v)$ where $u = g(x)$ and $v = h(x)$. Then $z$ depends on $x$ through two paths: $x \to u \to z$ and $x \to v \to z$. The chain rule says:

$$\frac{dz}{dx} \;=\; \frac{\partial z}{\partial u} \cdot \frac{du}{dx} \;+\; \frac{\partial z}{\partial v} \cdot \frac{dv}{dx}$$

Multiply along each path, then sum across paths. The pattern "multiply along, add across" is what backpropagation uses constantly — whenever a hidden neuron feeds into multiple downstream neurons, its gradient is the sum of contributions through all paths.

Example

$z = u^2 + v^2$ with $u = \sin(x)$ and $v = \cos(x)$.

Partials: $\partial z/\partial u = 2u$, $\partial z/\partial v = 2v$. Derivatives: $du/dx = \cos(x)$, $dv/dx = -\sin(x)$.

$$\frac{dz}{dx} = 2u \cdot \cos(x) + 2v \cdot (-\sin(x)) = 2\sin(x)\cos(x) - 2\cos(x)\sin(x) = 0$$

Zero! Which makes sense — $z = \sin^2 + \cos^2 = 1$, a constant, whose derivative is zero. The chain rule confirms it without us having to simplify.

This is the rule backprop uses

"Multiply along paths, sum across paths" is precisely the rule for hidden-layer deltas in backpropagation. When a neuron's output feeds into multiple downstream neurons, the gradient flowing back to that neuron is a sum: one term per downstream connection. The chain rule is doing this for us automatically.

13 Why this is backpropagation

A neural network is a function:

$$\text{loss}(x, t; W) = E\bigl(\,\sigma(W_2 \cdot \sigma(W_1 \cdot x))\, ;\,t\,\bigr)$$

That's a composition many layers deep. To train the network, we need to know $\partial \text{loss}/\partial W$ for every weight $W$. The chain rule tells us exactly how to compute it: walk from "loss" back to each weight, multiplying local derivatives at every stage.

The reason the algorithm is called backpropagation — and not just "the chain rule applied to neural networks" — is that the multiplications are done in a specific clever order. Computing all the gradients naively, weight by weight, would do a lot of repeated work: the factors near the loss show up in every weight's gradient computation. Backprop computes those shared factors once, stores them at each neuron (the symbol $\delta$ in the backprop tutorial), and reuses them.

weight W a σ(a) next layer loss forward pass — values flow this way → ← backward pass — gradient terms multiplied as they flow back
Fig. 4. Backpropagation is the chain rule, run in reverse along the network. Each backward arrow contributes one factor of the chain-rule product. Numbers stored at each neuron during the backward pass are exactly the partial products of all chain-rule factors downstream of that point.

14 Common confusions

Do I evaluate the outer derivative at $x$ or at $g(x)$?
At $g(x)$. The single most common error. Think of it this way: the outer function $f$ doesn't see $x$, it only sees whatever the inner function hands to it, which is $g(x)$. So $f$'s derivative is evaluated at the value $f$ is actually receiving.
What if there's a constant inside the inner function?
No special case — it just contributes through $g'(x)$. Example: $y = e^{3x}$. Inner $g(x) = 3x$, $g'(x) = 3$. Outer $f(u) = e^u$, $f'(u) = e^u$. So $dy/dx = e^{3x} \cdot 3 = 3e^{3x}$.
What if both inner and outer have multiple variables?
Use the multivariate version from §12: multiply along each path from input to output, add across paths. In the limit of arbitrary network depth and connectivity, this is exactly the backpropagation algorithm.
How do I know when I need the chain rule?
Whenever you see a function inside another function. Concrete tells: parentheses around something more complicated than a single variable ($\sin(x^2)$, $e^{-x}$, $(1+x)^{10}$, $\sqrt{x^2 + 1}$), or function notation with a non-trivial argument ($f(g(x))$). If you can write it as "first do X, then do Y," you have a composition, and you need the chain rule.
Why doesn't the proof in §5 worry about $g(x+h) = g(x)$?
It should — that's a hole. If $g(x+h) = g(x)$ for some sequence of $h$'s, the algebraic step "multiply and divide by the change in $g$" divides by zero. A rigorous proof handles that case separately. For all the functions you'll meet in machine learning (smooth, monotonic in neighborhoods), the issue doesn't arise. Take the §5 derivation as moral content, not a formal proof.
Why do paths add in the multivariate version?
Because contributions to a final change are independent until you sum them at the output. If $x$ affects $z$ via $u$ and via $v$, both effects happen — the total change in $z$ is the sum of "change due to $u$" and "change due to $v$." Same reason rates from independent sources add (the bank-account example in §7 of the derivatives tutorial).

15 Where to go next

You now have the entire mathematical core of backpropagation. Calculus has given you:

  • The idea that a derivative is the slope of a tangent (Part 1, §3).
  • How to differentiate elementary functions: $x^n$, $e^x$, $\sigma(x)$ (Part 1, §6–10).
  • Partial derivatives and the gradient vector (Part 1, §13–14).
  • The chain rule for compositions (this tutorial, §4).
  • The multivariate chain rule for many paths (this tutorial, §12).

Those five ideas — and absolutely no more — are what backpropagation runs on. The third tutorial in this series shows you the algorithm itself, applied to a tiny worked example: a neural network with 6 weights, traced from forward pass through gradient computation through weight update, with every number shown.

Continue → Backpropagation