Derivatives, taught slowly
A derivative is the slope of a tangent line — and that is enough to understand every piece of calculus that appears in machine learning. We build up from straight lines.
01 Why derivatives?
Every learning algorithm in machine learning works the same way: there is a number that says "how wrong the model is," and the algorithm nudges the model's parameters to make that number smaller. The question — for every parameter, in every model, on every step — is: if I wiggle this parameter a tiny bit, does the wrongness go up or down, and by how much?
That question has a name. The answer is the derivative. Once you know the derivative of "wrongness" with respect to a parameter, you know which direction to nudge it. Calculus calls this the slope of the tangent line; machine learning calls it the gradient; physicists call it the rate of change. They are all the same idea, and they are all the building block of backpropagation.
By the end of this page, you'll be able to (a) compute the derivative of any sum-of-power-functions, (b) compute the derivative of the exponential and the sigmoid by hand, (c) read partial derivatives without flinching, and (d) understand what a gradient vector is. That is all of calculus that backpropagation requires.
02 Slope of a straight line
Start with the easiest thing in mathematics: a straight line. If a line passes through two points $(x_1, y_1)$ and $(x_2, y_2)$, its slope is defined as rise over run:
The slope tells you: if I increase $x$ by 1, $y$ changes by this much. A slope of 2 means $y$ rises by 2 for every 1 we add to $x$. A slope of $-1$ means $y$ falls by 1.
The crucial property of a line is that the slope is the same everywhere on the line. It doesn't matter which two points you pick — same slope. That's what makes it a line.
03 Tangent to a curve
Now bend the line into a curve. A parabola, say — $y = x^2$. There is no longer a single slope, because the curve is steeper in some places than others. Near $x = 0$ it's almost flat; far from zero it's steep.
To recover the idea of a slope, we ask a more careful question: what is the slope at a specific point? Imagine zooming in on the curve right at one point. As you zoom in further and further, the curve looks more and more like a straight line. The slope of that straight line — the line that just kisses the curve at that one point — is the tangent.
Drag the red point along the curve below and watch the tangent line follow it. The tangent's slope is the derivative at that point:
So a derivative is a function: given an input $x$, it returns the slope of the curve at that point. For $y = x^2$, we'll discover in §5 that the derivative is $2x$. At $x = 1$, the slope is 2. At $x = -3$, the slope is $-6$. At $x = 0$, the slope is 0 — the curve is flat there, and that's the bottom of the bowl.
The derivative of $f(x)$ is written several ways, all meaning the same thing:
$f'(x)$ is the most compact. $df/dx$ reads as "the change in $f$ per change in $x$" and is helpful when you want to track which variable you're differentiating with respect to.
04 The limit definition
"The slope of the line you get by zooming in" is a good intuition, but we need a formula. Here it is. Take two nearby points on the curve: $(x, f(x))$ and $(x+h, f(x+h))$, where $h$ is a tiny number. The slope of the line connecting them — the secant line — is:
Now shrink $h$ toward zero. The two points come closer together; the secant line tilts toward the tangent. Once $h$ is "infinitely small," the secant has become the tangent, and we have the slope at $x$:
That symbol $\lim_{h \to 0}$ — "the limit as $h$ goes to zero" — is the formal way of saying "shrink $h$ down to nothing." Mathematicians spent the entire 19th century making "infinitely small" rigorous; we can get the whole benefit by trusting their work and just shrinking $h$ in our heads.
05 Derivative of x² from scratch
Let's actually compute it. We have $f(x) = x^2$. Apply the limit definition:
Expand the square: $(x+h)^2 = x^2 + 2xh + h^2$. So:
The $x^2$ terms cancel. Pull out a factor of $h$:
And as $h \to 0$, the leftover $h$ vanishes:
That's it. The derivative of $x^2$ is $2x$. At $x = 1$ the slope is 2, at $x = -3$ the slope is $-6$, at $x = 0$ the slope is 0 — exactly the numbers Fig. 2 was showing.
The trick is always the same: write down the difference quotient $[f(x+h) - f(x)]/h$, do enough algebra that $h$ cancels out of the denominator, then set the remaining $h$ to zero. The reason the algebra always works for "nice" functions is what differentiable means.
06 The power rule
If you did §5 again but for $f(x) = x^3$, you'd get $f'(x) = 3x^2$. For $x^4$, you'd get $4x^3$. The pattern is:
Pull the exponent down to the front, subtract 1 from the exponent. Done. This works for any real exponent — positive, negative, fractional. A few quick examples:
| Function | Derivative | How |
|---|---|---|
| $x^5$ | $5x^4$ | $n = 5$, so $5 x^{5-1}$ |
| $x^{10}$ | $10x^9$ | $n = 10$ |
| $x = x^1$ | $1$ | $n = 1$, so $1 \cdot x^0 = 1$ |
| $1 = x^0$ | $0$ | The derivative of a constant is zero |
| $\sqrt{x} = x^{1/2}$ | $\tfrac{1}{2}x^{-1/2}$ | $n = 1/2$ |
| $1/x = x^{-1}$ | $-x^{-2}$ | $n = -1$ |
"The derivative of a constant is zero" is the most useful one to internalize. A constant doesn't change, so its rate of change is zero. The line $y = 7$ is flat; the slope is zero everywhere.
07 Sums and constant multiples
Two more rules, each one a paragraph long:
Constant multiple rule
A constant out front comes along for the ride. The derivative of $5x^2$ is $5 \cdot 2x = 10x$.
Sum rule
The derivative of a sum is the sum of the derivatives — you can differentiate each piece separately and add. So:
Why the sum rule? Because rates of change add. If your bank account is growing by \$10/day from a job and \$3/day from interest, it's growing by \$13/day total. The rate of change of the total is the sum of the rates.
With just the power rule, sum rule, and constant multiple rule, you can differentiate any polynomial. That covers a huge fraction of the simple cases. The product rule and quotient rule exist for when functions multiply or divide each other, but we won't need them for the foundations of backpropagation — those use the chain rule (next tutorial) instead.
08 The exponential ex
The most magical function in calculus is $f(x) = e^x$, where $e \approx 2.71828\ldots$ is a specific irrational number. The magic:
It is its own derivative. The slope of the curve $y = e^x$ at any point equals the height of the curve at that point. There is no other function (up to a scalar multiple) that has this property — it's what defines $e$ in the first place.
Why care, for backprop? Because the sigmoid function — the activation we use in §9 of the backprop tutorial — is built out of $e^x$. Knowing the derivative of $e^x$ is what lets us compute the derivative of the sigmoid.
The derivative of $e^{kx}$ (for constant $k$) is $k\,e^{kx}$. That extra factor of $k$ comes from the chain rule, which we cover next. For now: $\frac{d}{dx}e^{-x} = -e^{-x}$, $\frac{d}{dx}e^{2x} = 2e^{2x}$.
09 The sigmoid σ(x)
The sigmoid function squashes any real number into the open interval $(0, 1)$:
Big positive input → output near 1. Big negative input → output near 0. Input zero → output exactly $1/2$. It is the original "soft on/off" of neural networks.
The next section computes the derivative of the sigmoid. It is, miraculously, expressible in terms of the sigmoid itself — and that miracle is what makes backpropagation cheap to compute.
10 σ′(x) — the magic identity
We claim, and now we will prove, that:
Read that aloud: the derivative of the sigmoid is the sigmoid times one minus the sigmoid. Because we have $\sigma(x)$ from the forward pass already (it's the neuron's output), we don't need to recompute anything — we just multiply.
Proof, line by line
Start with the definition, written as a negative exponent so the chain rule is easier:
The chain rule (covered next tutorial; you can take it on faith for now) gives:
The two minus signs cancel:
Now rewrite the numerator as $(1 + e^{-x}) - 1$:
Each piece is $\sigma$ or $\sigma^2$:
In backpropagation, every time the gradient flows backward through a sigmoid neuron, we multiply by $\sigma'(x) = y(1-y)$ where $y$ is the neuron's output. Because we already have $y$ from the forward pass, this costs one multiplication. No re-evaluating exponentials. The whole algorithm becomes a cascade of cheap multiplications because of this one identity.
11 A small table to memorize
These are the derivatives that show up in every neural network tutorial. Worth knowing cold.
| Function | Derivative | Why |
|---|---|---|
| $c$ (constant) | $0$ | doesn't change |
| $x$ | $1$ | slope of $y=x$ is 1 |
| $x^n$ | $n\,x^{n-1}$ | power rule |
| $e^x$ | $e^x$ | self-similar |
| $e^{kx}$ | $k\,e^{kx}$ | chain rule |
| $\ln(x)$ | $1/x$ | inverse of $e^x$ |
| $\sin(x)$ | $\cos(x)$ | — |
| $\cos(x)$ | $-\sin(x)$ | — |
| $\sigma(x)$ | $\sigma(x)(1-\sigma(x))$ | see §10 |
| $\tanh(x)$ | $1 - \tanh^2(x)$ | same trick |
| $\max(0,x)$ (ReLU) | $1$ if $x > 0$, else $0$ | piecewise |
Combined with the sum rule, the constant multiple rule, and (in the next tutorial) the chain rule, this table is enough to derive every derivative you'll see in a neural network.
12 Functions of many variables
Up to now, $f$ has taken one number and returned one number. Neural networks don't work that way. A loss function takes all the weights — sometimes millions of them — and returns one number (how wrong the network is). So we need to handle functions like:
This is a function of two variables. You can picture it as a surface above the $(x, y)$ plane — taller where $f$ is bigger. Or as a bowl-shaped landscape. Neural-network loss functions are very high-dimensional landscapes; the picture in your head stops working past three dimensions, but the math doesn't change.
The question is: what's the analog of a "derivative" when there are multiple input variables?
13 Partial derivatives
The answer: instead of one derivative, there are now several. One for each input variable. They're called partial derivatives, and the notation switches from $d/dx$ to $\partial/\partial x$ to signal that "$x$ is one of many" rather than "$x$ is the only variable."
Mechanically, computing a partial derivative is exactly like computing an ordinary derivative — you just pretend the other variables are constants. Watch:
Example: f(x, y) = x² + 3xy + y²
To compute $\partial f / \partial x$, treat $y$ as if it were the number 7 (or any other constant). Then:
- $x^2$ differentiates to $2x$ (power rule).
- $3xy$ — with $y$ a constant — differentiates to $3y$ (constant out front).
- $y^2$ — pure constant — differentiates to $0$.
For $\partial f / \partial y$, switch roles: $x$ is now the constant.
- $x^2$ → $0$.
- $3xy$ → $3x$.
- $y^2$ → $2y$.
$\partial f / \partial x = 2x + 3y$ means: if I'm at the point $(x, y) = (1, 4)$, and I wiggle $x$ a tiny bit, $f$ changes at a rate of $2(1) + 3(4) = 14$ per unit of wiggling. Wiggle $y$ instead and the rate is $3(1) + 2(4) = 11$. Different variables, different sensitivities.
14 The gradient vector
When a function has $n$ inputs, it has $n$ partial derivatives. We bundle them into a vector called the gradient:
For our example $f(x, y) = x^2 + 3xy + y^2$, the gradient is:
This is a vector that lives at each point of the input space. It has two related geometric meanings, both useful:
The whole purpose of backpropagation is to compute the gradient of a loss function with respect to the weights of a neural network. Once you have the gradient, you take a step opposite to it ("downhill"), which lowers the loss a little. Repeat thousands of times. That's training.
a gradient is just a vector of partial derivatives — there is no extra concept
15 Common confusions
16 Where to go next
You now have most of the calculus you need for backpropagation. There is one missing ingredient: how to differentiate compositions of functions, like $\sigma(W \cdot x + b)$. That's the chain rule, and it has its own tutorial.
Once you have the chain rule, you'll be ready for the third and final piece — backpropagation itself, which is nothing more than the chain rule applied to the layers of a neural network.