Foundations for backpropagation · part 1 of 3

Derivatives, taught slowly

A derivative is the slope of a tangent line — and that is enough to understand every piece of calculus that appears in machine learning. We build up from straight lines.

Part of the backprop series ~35 min read 4 interactive figures

01 Why derivatives?

Every learning algorithm in machine learning works the same way: there is a number that says "how wrong the model is," and the algorithm nudges the model's parameters to make that number smaller. The question — for every parameter, in every model, on every step — is: if I wiggle this parameter a tiny bit, does the wrongness go up or down, and by how much?

That question has a name. The answer is the derivative. Once you know the derivative of "wrongness" with respect to a parameter, you know which direction to nudge it. Calculus calls this the slope of the tangent line; machine learning calls it the gradient; physicists call it the rate of change. They are all the same idea, and they are all the building block of backpropagation.

What you'll come out with

By the end of this page, you'll be able to (a) compute the derivative of any sum-of-power-functions, (b) compute the derivative of the exponential and the sigmoid by hand, (c) read partial derivatives without flinching, and (d) understand what a gradient vector is. That is all of calculus that backpropagation requires.

02 Slope of a straight line

Start with the easiest thing in mathematics: a straight line. If a line passes through two points $(x_1, y_1)$ and $(x_2, y_2)$, its slope is defined as rise over run:

$$\text{slope} = \frac{\Delta y}{\Delta x} = \frac{y_2 - y_1}{x_2 - x_1}$$

The slope tells you: if I increase $x$ by 1, $y$ changes by this much. A slope of 2 means $y$ rises by 2 for every 1 we add to $x$. A slope of $-1$ means $y$ falls by 1.

The crucial property of a line is that the slope is the same everywhere on the line. It doesn't matter which two points you pick — same slope. That's what makes it a line.

x y (x₁,y₁) (x₂,y₂) Δx (run) Δy (rise) slope = Δy / Δx
Fig. 1. The slope of a line is "rise over run." Constant — same everywhere on the line.

03 Tangent to a curve

Now bend the line into a curve. A parabola, say — $y = x^2$. There is no longer a single slope, because the curve is steeper in some places than others. Near $x = 0$ it's almost flat; far from zero it's steep.

To recover the idea of a slope, we ask a more careful question: what is the slope at a specific point? Imagine zooming in on the curve right at one point. As you zoom in further and further, the curve looks more and more like a straight line. The slope of that straight line — the line that just kisses the curve at that one point — is the tangent.

Drag the red point along the curve below and watch the tangent line follow it. The tangent's slope is the derivative at that point:

Fig. 2. The derivative is the slope of the tangent line. Drag to see the slope change from negative (curve descending) through zero (bottom of the bowl) to positive (curve ascending). The number under the curve is the slope at the current point.

So a derivative is a function: given an input $x$, it returns the slope of the curve at that point. For $y = x^2$, we'll discover in §5 that the derivative is $2x$. At $x = 1$, the slope is 2. At $x = -3$, the slope is $-6$. At $x = 0$, the slope is 0 — the curve is flat there, and that's the bottom of the bowl.

Notation

The derivative of $f(x)$ is written several ways, all meaning the same thing:

$$f'(x) \;=\; \frac{df}{dx} \;=\; \frac{dy}{dx} \;=\; Df(x)$$

$f'(x)$ is the most compact. $df/dx$ reads as "the change in $f$ per change in $x$" and is helpful when you want to track which variable you're differentiating with respect to.

04 The limit definition

"The slope of the line you get by zooming in" is a good intuition, but we need a formula. Here it is. Take two nearby points on the curve: $(x, f(x))$ and $(x+h, f(x+h))$, where $h$ is a tiny number. The slope of the line connecting them — the secant line — is:

$$\frac{f(x+h) - f(x)}{h}$$

Now shrink $h$ toward zero. The two points come closer together; the secant line tilts toward the tangent. Once $h$ is "infinitely small," the secant has become the tangent, and we have the slope at $x$:

$$f'(x) \;=\; \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$

That symbol $\lim_{h \to 0}$ — "the limit as $h$ goes to zero" — is the formal way of saying "shrink $h$ down to nothing." Mathematicians spent the entire 19th century making "infinitely small" rigorous; we can get the whole benefit by trusting their work and just shrinking $h$ in our heads.

secant slope:3.50 true tangent:2.00
Fig. 3. The secant line (red) connects two points $h$ apart on $y = x^2$. As you slide $h$ toward zero, the secant tilts toward the tangent (teal), and its slope $\frac{(x+h)^2 - x^2}{h}$ approaches the true derivative $2x = 2$ at $x = 1$.

05 Derivative of x² from scratch

Let's actually compute it. We have $f(x) = x^2$. Apply the limit definition:

$$f'(x) = \lim_{h \to 0} \frac{(x+h)^2 - x^2}{h}$$

Expand the square: $(x+h)^2 = x^2 + 2xh + h^2$. So:

$$f'(x) = \lim_{h \to 0} \frac{x^2 + 2xh + h^2 - x^2}{h} = \lim_{h \to 0} \frac{2xh + h^2}{h}$$

The $x^2$ terms cancel. Pull out a factor of $h$:

$$= \lim_{h \to 0} \frac{h(2x + h)}{h} = \lim_{h \to 0} (2x + h)$$

And as $h \to 0$, the leftover $h$ vanishes:

$$f'(x) = 2x$$

That's it. The derivative of $x^2$ is $2x$. At $x = 1$ the slope is 2, at $x = -3$ the slope is $-6$, at $x = 0$ the slope is 0 — exactly the numbers Fig. 2 was showing.

What happened

The trick is always the same: write down the difference quotient $[f(x+h) - f(x)]/h$, do enough algebra that $h$ cancels out of the denominator, then set the remaining $h$ to zero. The reason the algebra always works for "nice" functions is what differentiable means.

06 The power rule

If you did §5 again but for $f(x) = x^3$, you'd get $f'(x) = 3x^2$. For $x^4$, you'd get $4x^3$. The pattern is:

$$\frac{d}{dx} x^n \;=\; n\, x^{n-1}$$

Pull the exponent down to the front, subtract 1 from the exponent. Done. This works for any real exponent — positive, negative, fractional. A few quick examples:

FunctionDerivativeHow
$x^5$$5x^4$$n = 5$, so $5 x^{5-1}$
$x^{10}$$10x^9$$n = 10$
$x = x^1$$1$$n = 1$, so $1 \cdot x^0 = 1$
$1 = x^0$$0$The derivative of a constant is zero
$\sqrt{x} = x^{1/2}$$\tfrac{1}{2}x^{-1/2}$$n = 1/2$
$1/x = x^{-1}$$-x^{-2}$$n = -1$

"The derivative of a constant is zero" is the most useful one to internalize. A constant doesn't change, so its rate of change is zero. The line $y = 7$ is flat; the slope is zero everywhere.

07 Sums and constant multiples

Two more rules, each one a paragraph long:

Constant multiple rule

$$\frac{d}{dx}\bigl(c \cdot f(x)\bigr) = c \cdot f'(x)$$

A constant out front comes along for the ride. The derivative of $5x^2$ is $5 \cdot 2x = 10x$.

Sum rule

$$\frac{d}{dx}\bigl(f(x) + g(x)\bigr) = f'(x) + g'(x)$$

The derivative of a sum is the sum of the derivatives — you can differentiate each piece separately and add. So:

$$\frac{d}{dx}(3x^2 + 5x - 7) = 6x + 5 - 0 = 6x + 5$$

Why the sum rule? Because rates of change add. If your bank account is growing by \$10/day from a job and \$3/day from interest, it's growing by \$13/day total. The rate of change of the total is the sum of the rates.

What we have so far

With just the power rule, sum rule, and constant multiple rule, you can differentiate any polynomial. That covers a huge fraction of the simple cases. The product rule and quotient rule exist for when functions multiply or divide each other, but we won't need them for the foundations of backpropagation — those use the chain rule (next tutorial) instead.

08 The exponential ex

The most magical function in calculus is $f(x) = e^x$, where $e \approx 2.71828\ldots$ is a specific irrational number. The magic:

$$\frac{d}{dx} e^x = e^x$$

It is its own derivative. The slope of the curve $y = e^x$ at any point equals the height of the curve at that point. There is no other function (up to a scalar multiple) that has this property — it's what defines $e$ in the first place.

Fig. 4. $e^x$ (teal) and its derivative (amber). They overlap perfectly — that's the defining property. Drag to confirm at any point, the slope equals the height.

Why care, for backprop? Because the sigmoid function — the activation we use in §9 of the backprop tutorial — is built out of $e^x$. Knowing the derivative of $e^x$ is what lets us compute the derivative of the sigmoid.

A useful chain-rule preview

The derivative of $e^{kx}$ (for constant $k$) is $k\,e^{kx}$. That extra factor of $k$ comes from the chain rule, which we cover next. For now: $\frac{d}{dx}e^{-x} = -e^{-x}$, $\frac{d}{dx}e^{2x} = 2e^{2x}$.

09 The sigmoid σ(x)

The sigmoid function squashes any real number into the open interval $(0, 1)$:

$$\sigma(x) \;=\; \frac{1}{1 + e^{-x}}$$

Big positive input → output near 1. Big negative input → output near 0. Input zero → output exactly $1/2$. It is the original "soft on/off" of neural networks.

Fig. 5. The sigmoid, with its derivative drawn underneath. Drag the dot; notice the derivative peaks at $x = 0$ where the sigmoid is steepest, and is near zero far from the center where the sigmoid is flat.

The next section computes the derivative of the sigmoid. It is, miraculously, expressible in terms of the sigmoid itself — and that miracle is what makes backpropagation cheap to compute.

10 σ′(x) — the magic identity

We claim, and now we will prove, that:

$$\sigma'(x) \;=\; \sigma(x)\cdot \bigl(1 - \sigma(x)\bigr)$$

Read that aloud: the derivative of the sigmoid is the sigmoid times one minus the sigmoid. Because we have $\sigma(x)$ from the forward pass already (it's the neuron's output), we don't need to recompute anything — we just multiply.

Proof, line by line

Start with the definition, written as a negative exponent so the chain rule is easier:

$$\sigma(x) = (1 + e^{-x})^{-1}$$

The chain rule (covered next tutorial; you can take it on faith for now) gives:

$$\sigma'(x) = -1 \cdot (1 + e^{-x})^{-2} \cdot \frac{d}{dx}(1 + e^{-x}) = -(1 + e^{-x})^{-2} \cdot (-e^{-x})$$

The two minus signs cancel:

$$\sigma'(x) = \frac{e^{-x}}{(1 + e^{-x})^2}$$

Now rewrite the numerator as $(1 + e^{-x}) - 1$:

$$\sigma'(x) = \frac{(1+e^{-x}) - 1}{(1+e^{-x})^2} = \frac{1}{1+e^{-x}} - \frac{1}{(1+e^{-x})^2}$$

Each piece is $\sigma$ or $\sigma^2$:

$$\sigma'(x) = \sigma(x) - \sigma(x)^2 = \sigma(x)\bigl(1 - \sigma(x)\bigr) \quad\checkmark$$
Why this matters for backprop

In backpropagation, every time the gradient flows backward through a sigmoid neuron, we multiply by $\sigma'(x) = y(1-y)$ where $y$ is the neuron's output. Because we already have $y$ from the forward pass, this costs one multiplication. No re-evaluating exponentials. The whole algorithm becomes a cascade of cheap multiplications because of this one identity.

11 A small table to memorize

These are the derivatives that show up in every neural network tutorial. Worth knowing cold.

FunctionDerivativeWhy
$c$ (constant)$0$doesn't change
$x$$1$slope of $y=x$ is 1
$x^n$$n\,x^{n-1}$power rule
$e^x$$e^x$self-similar
$e^{kx}$$k\,e^{kx}$chain rule
$\ln(x)$$1/x$inverse of $e^x$
$\sin(x)$$\cos(x)$
$\cos(x)$$-\sin(x)$
$\sigma(x)$$\sigma(x)(1-\sigma(x))$see §10
$\tanh(x)$$1 - \tanh^2(x)$same trick
$\max(0,x)$ (ReLU)$1$ if $x > 0$, else $0$piecewise

Combined with the sum rule, the constant multiple rule, and (in the next tutorial) the chain rule, this table is enough to derive every derivative you'll see in a neural network.

12 Functions of many variables

Up to now, $f$ has taken one number and returned one number. Neural networks don't work that way. A loss function takes all the weights — sometimes millions of them — and returns one number (how wrong the network is). So we need to handle functions like:

$$f(x, y) = x^2 + 3xy + y^2$$

This is a function of two variables. You can picture it as a surface above the $(x, y)$ plane — taller where $f$ is bigger. Or as a bowl-shaped landscape. Neural-network loss functions are very high-dimensional landscapes; the picture in your head stops working past three dimensions, but the math doesn't change.

The question is: what's the analog of a "derivative" when there are multiple input variables?

13 Partial derivatives

The answer: instead of one derivative, there are now several. One for each input variable. They're called partial derivatives, and the notation switches from $d/dx$ to $\partial/\partial x$ to signal that "$x$ is one of many" rather than "$x$ is the only variable."

$$\frac{\partial f}{\partial x}: \text{the slope if I wiggle just } x, \text{ holding all other variables fixed}$$

Mechanically, computing a partial derivative is exactly like computing an ordinary derivative — you just pretend the other variables are constants. Watch:

Example: f(x, y) = x² + 3xy + y²

To compute $\partial f / \partial x$, treat $y$ as if it were the number 7 (or any other constant). Then:

  • $x^2$ differentiates to $2x$ (power rule).
  • $3xy$ — with $y$ a constant — differentiates to $3y$ (constant out front).
  • $y^2$ — pure constant — differentiates to $0$.
$$\frac{\partial f}{\partial x} = 2x + 3y$$

For $\partial f / \partial y$, switch roles: $x$ is now the constant.

  • $x^2$ → $0$.
  • $3xy$ → $3x$.
  • $y^2$ → $2y$.
$$\frac{\partial f}{\partial y} = 3x + 2y$$
Reading partial derivatives

$\partial f / \partial x = 2x + 3y$ means: if I'm at the point $(x, y) = (1, 4)$, and I wiggle $x$ a tiny bit, $f$ changes at a rate of $2(1) + 3(4) = 14$ per unit of wiggling. Wiggle $y$ instead and the rate is $3(1) + 2(4) = 11$. Different variables, different sensitivities.

14 The gradient vector

When a function has $n$ inputs, it has $n$ partial derivatives. We bundle them into a vector called the gradient:

$$\nabla f \;=\; \left(\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n}\right)$$

For our example $f(x, y) = x^2 + 3xy + y^2$, the gradient is:

$$\nabla f = (2x + 3y,\ 3x + 2y)$$

This is a vector that lives at each point of the input space. It has two related geometric meanings, both useful:

Direction
$\nabla f$ at a point $(x, y)$ points in the direction in which $f$ increases fastest. To go uphill the most steeply, walk along the gradient. To go downhill — which is what gradient descent and backpropagation want — walk along the negative of the gradient.
Magnitude
The length of $\nabla f$ tells you how steep the climb is. Where $\nabla f$ is long, the function changes rapidly; where it's short (near a minimum or maximum), the function is nearly flat.

The whole purpose of backpropagation is to compute the gradient of a loss function with respect to the weights of a neural network. Once you have the gradient, you take a step opposite to it ("downhill"), which lowers the loss a little. Repeat thousands of times. That's training.

a gradient is just a vector of partial derivatives — there is no extra concept

15 Common confusions

What's the difference between $d$ and $\partial$?
Pure notation. $df/dx$ is used when $f$ has only one input variable; $\partial f / \partial x$ is used when $f$ has several and you're holding all others fixed. The arithmetic is identical.
Why do we need limits and "infinitely small" stuff?
Because the slope of a tangent line is the slope of a line through two coinciding points — and "rise over run" with zero run is $0/0$, which is meaningless. Limits formalize the idea of "let the two points come together while keeping the ratio sensible." For every smooth function in machine learning, the limit turns out to give a perfectly well-defined number, which is the derivative.
A function can have no derivative at some points — when?
When the curve has a sharp corner, the tangent isn't well-defined: the slope coming in from the left is different from the slope going out to the right. ReLU has this problem at $x = 0$; in practice ML frameworks pick a convention (usually call it 0 or 1) and move on. Smoothly-curving functions like the sigmoid have derivatives everywhere.
Why is the derivative "rate of change" — what's changing?
When we talk about $df/dx$, the implicit picture is: $x$ is increasing steadily (think of $x$ as time), and $f$ is changing as $x$ changes. The derivative tells you how fast $f$ is changing per unit of $x$. In machine learning, $x$ is usually a weight in the network, and $f$ is the loss; the derivative tells you how the loss responds when you nudge that weight.
Are gradients always vectors?
When the function returns a single number (a scalar) and takes a vector of inputs — like a loss function — the gradient is a vector with one entry per input. If the function returned multiple numbers (a vector), the natural analog is a matrix called the Jacobian. For backpropagation the loss is always a scalar, so gradients are always vectors.

16 Where to go next

You now have most of the calculus you need for backpropagation. There is one missing ingredient: how to differentiate compositions of functions, like $\sigma(W \cdot x + b)$. That's the chain rule, and it has its own tutorial.

Once you have the chain rule, you'll be ready for the third and final piece — backpropagation itself, which is nothing more than the chain rule applied to the layers of a neural network.

Continue → Chain rule