Sunday, 9 July 2017

The RIGHT way to think about derivatives

At secondary school and then university, we meet a variety of definitions of the derivative, all motivated by the fundamental idea that the derivative is telling us something about the rate of change of some quantity as another varies. The definitions are likely to start with a fairly intuitive one and become more rigorous, incorporating the notion of limit. Unfortunately, as the rigour and degree of precision grow, it is possible for the student to become lost in a thicket of notation, and to lose sight of just what they are studying. However, I claim that if one thinks of the derivative the right way, then one has a fully rigorous concept of derivative which keeps fruitful contact with the intuitive one.

Let's take a look at (some of) these definitions, before settling on the right one.

A menu of definitions

The slope of the tangent to a graph

This is a fairly straightforward idea. You start off with the graph of a function, say (just to be conventional) \(y=f(x)\), and at some chosen value of \(x\) you draw the tangent to the curve. Then the gradient is how fast the height of the graph changes as \(x\) changes, so that sounds good.

At least, it sound good until you start thinking about just what you might mean by 'tangent'. If the curve is a circle, that's pretty unambiguous: you take the line perpendicular to the radius at the point of interest, and we all know that is what is meant by the tangent.

Actually, since I am talking about the tangent to a graph, I've just used the upper half of the circle, given by \(y=\sqrt{1-x^2}\) for \(-1 \leq x \leq 1\).

But what if the graph isn't (part of) a circle?

We could try 'the tangent to \(y=f(x)\) at \((x,f(x))\) is the line that touches the graph at just the point \((x,f(x))\), without crossing it'. That works for circles, but now also includes some other well-known curves, including the parabola and hyperbola.


So we have generalized the notion of tangent usefully, to include some new curves. But what about this?

It looks like what we want, but cuts the graph in more than one place. But somehow that looks all right: the 'bad' intersection is a long way away from the point where we want the tangent. Clearly being a tangent is a local propery, so we can fix this by saying 'the tangent to \(y=f(x)\) at \((x,f(x)\)) is line that touches the graph at just one point in the vicinity of \((x,f(x))\) without crossing it'.

But what about this?

The straight line certainly looks like a tangent, except that it crosses the graph at the point of tangency. There's no obvious way of retaining the notion of 'touching the graph without crossing it' here.

So eventually we accept that this is going to be hard to fix, and that something a little subtler is required.

The chord slope definition

Instead of thinking directly about the tangent to a point on the graph, we think of an approximation to it, and see how that can lead to a notion of tangent.

So although we want the tangent at \((x,f(x))\) we decide to look instead at the straight line joining two points on the graph: \((x,f(x))\) and \((x+h,f(x+h))\), where we insist that \(h \neq 0\).

The gradient is \[ \frac{f(x+h)-f(x)}{(x+h)-x} = \frac{f(x+h)-f(x)}{h} \] and it seems entirely reasonable (because it is) to say that the slope of the tangent is the limiting value of this as \(h\) gets closer and closer to \(0\). So the tangent to \(y=f(x)\) at \((x,f(x))\) is the line through \((x,f(x))\) with this gradient. At first exposure, we might not go into too many details of just what we mean by the limiting value, sticking with examples where no obvious complications arise.

This is a meaningful definition. Using it we can calculate the derivative of a variety of simple functions, such as powers of \(x\) and, with the help of a little geometric ingenuity, the trig functions \(\sin\) and \(\cos\).

It's also a sensible definition, because it really does seem to do the job we want to do. It doesn't make sense to say it's 'correct', because this defines the notion. But it does have the properties we'd hope for, and it's something we can calculate with. It's worth noticing that in the process we have shifted from defining the derivative in terms of the tangent to defining the tangent in terms of the derivative!

I claimt that even so, it isn't the right way to think about the derivative.

The formal limit definition

After spending some time with the chord slope definition, and informally working out some simple limits, it's usually time to give a more rigorous idea of what is going in here.

We then introduce the following incantation: \[ \lim_{x \to a}f(x) = L \] means \[ \forall \epsilon>0, \exists \delta>0 \text{ such that } 0 \lt |x-a| \lt \delta \Rightarrow |f(x)-L| \lt \epsilon \] At first sight, and especially for the novice, this is likely to cause a great deal of fear and trembling. It is a complex statement which makes very precise a simple idea. The idea is that we can make \(f(x)\) as close as we want to \(f(a)\) by making \(x\) as close as we have to to \(a\).

With this in hand, we can tighten up some of what is done with the chord slope idea by writing \[ f'(x)= \lim_{h \to 0} \frac{f(x+h)-f(x)}{h} \] It isn't really anything new, it's just a more precise way of stating the previous idea.

There's also the minor variation that \(f\) has derivative \(f'(x)\) at \(x\) if \[ \lim_{h \to 0} \left| \frac{f(x+h)-f(x)}{h} - f'(x) \right| = 0 \]

Working with these, including showing that they are equivalent, gives some practice in understanding the formal statement of what is meant by a limit, and how that formal definition is used. But I also claim that this is not the right way to think about the derivative.

Best Linear Approximation

Once the derivative has been defined and calculated in some tractable cases, various theorems about derivatives are presented. Here's one that gets used a lot. \[ f(x+h) \approx f(x)+hf'(x) \] or rather, with some foresight, \[ f(x+h)-f(x) \approx f'(x)h \] Now, as it stands, this says less than you might think. As long as the function \(f\) is continuous at \(x\), then for very small \(h\), \(f(x)\) and \(f(x+h)\) are very close together, and \(f'(x)h\) is also very small, so the the approximation gets better and better as \(h\) approaches \(0\), no matter what value we use for \(f'(x)\). But this is just saying that both sides are approaching \(0\), and this obviously isn't all we mean by that approximate formula.

Let's tease it out by looking a little more carefully. Instead of having an approximate equality, we will have an exact equality which includes an error term. \[ f(x+h) - f(x) = f'(x)h+e(h) \] where \(e(h)\) is the error, which in general will depend on the size of \(h\). Now, it turns out that if \(f\) is differentiable at \(x\), with derivative \(f'(x)\), then as \(h\) gets closer and closer to \(0\), not only does \(e(h)\) get arbitrarily close to \(0\), but \(e(h)/h\) also gets arbitrarily small. When this happens I say that \(e(h)\) is suitably small.

In other words, for very small \(h\), the error is a very small fraction of \(h\), and we can make that fractional error as small as we want by choosing \(h\) small enough.

It's this behaviour of the error term that makes it all work. If we choose any other value, say \(K\) instead of \(f'(x)\) and try to use that instead, we have \[ f(x+h) - f(x) = Kh + E(h) \] where \(E(h)=(f'(x)-K)h +e(h)\). But for very small \(h\), we see that \[ \frac{E(h)}{h}=f'(x)-K + \frac{e(h)}{h} \] and we cannot make this small by making \(h\) small. So the special property of \(f'(x)\) is that \(f'(x)h\) is the best linear approximation to \(f(x+h)-f(x)\).

To make contact with the usual limit definition, we note that there is a best linear approximation if and only if the error term is suitably small.

I claim that this is the best way to think about the derivative.

What's so great about it?

The first thing to say is, it doesn't make it any easier to calculate a derivative. That's still just the same as it always was.

What's so great about it is that it gives insight into how derivatives behave. This is not to say that it makes the proofs of behaviour easier: but it does make the results easier to see and understand.

Differentiation rules

We learn the various rules for differentiating combinations of functions in our first calculus course: linearity, the product, quotient and chain rules. Let's see how this point of view does with them.

In each case, we'll think about two functions \(f\) and \(g\), and use \(e_f\) and \(e_g\) for the corresponding errors.


Suppose we have two real numbers \(\alpha\) and \(\beta\). Then \[ \begin{split} (\alpha f+\beta g)(x+h)-(\alpha f+\beta g)(x) &= \alpha f(x+h)+ \beta g(x+h) - \alpha f(x)-\beta g(x)\\ &=\alpha(f(x+h)-f(x))+\beta(g(x+h)-g(x))\\ &= \alpha (f'(x)h+e_f(h))+\beta(g'(x)h+e_g(h)\\ &=(\alpha f'(x)+\beta g'(x))h +\alpha e_f(h)+\beta e_g(h)\\ &=(\alpha f'(x)+\beta g'(x))h + e(h) \end{split} \] where \(e(h)\) is obviously suitable small.

So differentiation is linear.

Product rule

\[ \begin{split} (fg)(x+h)-(fg)(h) &= f(x+h)g(x+h)-f(x)g(x)\\ &= (f(x)+f'(x)h+e_f(x))g(x+g'(x)h+e_g(x))-f(x)g(x)\\ &= (f'(x)g(x)+f(x)g'(x))h +f'(x)he_g(x)+g'(x)he_f(x)\\ &= (f'(x)g(x)+f(x)g'(x))h + e(h) \end{split} \] where \(e(h)\) is still fairly obviously suitably small.

Quotient rule

Actually, I'll cheat slightly. Assuming that \(f(x) \neq 0\), we have \[ \begin{split} \frac{1}{f(x+h)} - \frac{1}{f(x)} &= \frac{1}{f(x)+f'(x)h+e_f(h}-\frac{1}{f(x)}\\ &=\frac{-f'(x)h-e_f(h)}{f(x)(f(x)+f'(x)h+e_f(h))}\\ &= -\frac{f'(x)}{f(x)(f(x)+f'(x)h+e_f(h))} - \frac{e_f(h)}{f(x)(f(x)+f'(x)h+e_f(h))} \end{split} \] which is less obviously, but quite plausibly \[ -\frac{f'(x)}{f(x)^2}h + e(h) \] where \(e(h)\) is suitably small.

Then the quotient rule is just the product rule combined with this one.

Chain rule

And finally, we have \[ \begin{split} f(g(x+h))-f(g(x)) &= f(g(x)+g'(x)h+e_g(h))-f(g(x))\\ &= f(g(x))+f'(g(x))(g'(x)h+e_g(h)) -f(g(x))\\ &= f'(g(x))g'(x)h+e(h)) \end{split} \] where again it is not immediately obvious, but is quite plausible that \(e(h)\) is suitably small.

Insight, not proof

So we can see that in each case, assuming that the behaviour of error terms is not unreasonable, this idea of the derivative as the best linear approximation to the change in the function value gives us insight into why the various rules for differentiation work as they do.

But there's more

This works very well. The mental picture of a best linear approximation is much more meaningful than that of some limiting value, and the arguments involving it are rather more intuitive than those involving the usual limit definition.

But, as is often the way with a good idea, we get more than we seem to have paid for.

Functions of several variables

Suppose that now instead of having a function \(f:\mathbb{R} \to \mathbb{R}\) we have \(f:\mathbb{R}^n \to \mathbb {R}\). The picture is much the same except that \(x\) and \(h\) are now vectors with \(n\) components, and I will adopt the standard convention that they are represented as column vectors. So we can still say that \(f\) is differentiable at \(x\) with derivative \(f'(x)\) if this \(f'(x)\) has the property that \[ f(x+h)-f(x)=f'(x)h +e(h) \] where \(e(h)\) is suitably small.

But now we can think about what sort of thing this \(f'(x)\) actually is. \(h\) is a vector with \(n\) components, and so \(f'(x)\) must be a linear mapping from the space of these vectors to \(\mathbb{R}\), in other words it must be a row vector, or covector, with \(n\) components.

This is where we first see why I wrote \(f'(x)h\) rather than \(hf'(x)\) previously. When \(f\) was a real valued function of a real variable, it made no difference, but now the order is significant, and is determined by the requirement that we end up with a scalar quantity.

The next obvious question is, what are these components of \(f'(x)\)? We can answer this by choosing \(h\) carefully. To pick out components of vectors, I'll use a superscript for a component of a column vector and a subscript for a component of a row vector. Then we have \[ f(x+h)-f(x) = f'(x)h +e(h) = f'(x)_i h^i +e(h) \] which tells us that \[ f'(x)_i = \frac{\partial f}{\partial x^i}(x) \]

So our definition of 'derivative' naturally produces the object we learn to call the gradient of \(f\) in vector calculus (and maybe gives a little more explanation for the name).

In fact, it tells us more. It tells us that partial derivatives are really components of the best linear approximation, i.e. the actual derivative of the function. The partial derivatives are not best thought of as objects which we define as ordinary derivatives 'holding all but one variable constant' but as components of a (row) vector naturally associated with the function.

Again, thinking of the derivative as the best linear approximation gives us a deeper insight into a chunk of familiar calculus. As a bonus, it also gives us our first glimpse of the fact that the derivative of a function of this type is not a vector, but is a covector. This is a distinction that grows in importance as one generalizes from functions on \(\mathbb{R}^n\) to functions on a surface in \(\mathbb{R}^n\), and finally to functions on a manifold which may or may not be equipped with a (semi-)Riemannian metric. It is a gift which keeps on giving.

Tangent (velocity) vector

In the last example, we looked at functions which associate a value to each point of space. This time, we'll do the opposite, and think about functions which associate a point in space to each value of some parameter which we can think of as time. So we have functions of the form \(f:\mathbb{R} \to \mathbb{R}^n\), and use \(t\) for the argument of the function.

We can now play the same game and see where it takes us. \(f\) is differentiable with derivative \(f'(t)\) if \[ f(t+h)-f(t) = f'(t) h + e(h) \] where \(e(h\) is suitably small.

This time, we see that \(h\) is just a number, and so \(f'(t)\) really is a vector with \(n\) components. And by looking at the components of \(f(t)\), we see that \[ f'(t)^i = \frac{d f^i}{dt}(t) \] and so the derivative of \(f\) really is the velocity of a point whose position is given by as a function of \(t\) by \(f(t)\).

It's useful to note that the character of \(f'\), i.e. whether it is a scalar quantity, or a vector, or a covector, is entirely determined by the dimensions of the domain and codomain of \(f\). We don't choose whether to make a vector or a covector out of the derivatives of \(f\); that is decided by the fact that \(f'\) is the best linear approximation.

The whole hog

Let's combine the two ideas up above, and consider \(f:\mathbb{R}^n \to \mathbb{R}^m\), so now \(f(x)\) is a vector with \(m\) components and \(x\) is a vector with \(n\) components. As always, \(f\) is differentiable with derivative \(f'(x)\) if \[ f(x+h)-f(x) = f'(x)h + e(h) \] where \(e(h)\) is suitably small.

Now, \(f'(x)\) is a linear transformation which takes a vector with \(n\) components and produces a vector with \(m\) components. In other words, it is an \(m \times n\) matrix. Using my previous notational device of labelling rows with superscripts and columns with subscripts, we now have \[ f'(x)^i_j = \frac{\partial f^i}{\partial x^j} \] and the hidden agenda behind the placement of the indices now becomes slightly exposed. (But trust me, there is plenty still lurking behind the tapestry.)

So the power of thinking of the derivative as the best linear approximation should now be rather clearer: we can't always use the chord-slope definition, though we can generally pick an aspect of the set-up where it can be used to define, say, partial derivatives. But the best linear approximation makes sense in great generality, and the familiar objects built out of partial or component derivatives can now be seen as the natural objects, and the partial or component derivatives as analogous to the components of vectors.


In general, we can't multiply, divide, or compose functions. But whenever we can, the rules for differentiating combinations work in just the same way. In fact, if you revisit the (plausibility) argument for each rule, you can see that all that matters is that the algebraic operation makes sense. Although they were written with functions of the form \(\mathbb{R} \to \mathbb{R}\) in mind, in fact that is irrelevant.

Probably the most powerful aspect of this comes from the chain rule. It tells us that the derivative of the composition is the product (in general, a matrix product) of the derivatives of the functions. A beautiful consequenc is that if \(f,g:\mathbb{R}^n \to \mathbb{R}^n\) are inverses of each other, so are the derivatives. So the partial derivatives of \(f\) are not the inverses of the partial derivatives of \(g\) (a standard error of students in advanced calculus courses), but the matrices built out of the partial derivatives are in inverse relation to one another.

Algebra for and from calculus

So we are just beginning to see that what we know about linear algebra (vector spaces, dual vector spaces, linear transformations) is relevant to calculus. Since the derivative is the best linear approximation, and derivatives of combinations of functions are linear algebraic combinations of the derivatives, we can use the apparatus of linear algebra to tell us about the (local) behaviour of non-linear functions.

In some ways, this is the entire point of differential calculus: it is a tool for turning the difficult analysis of nonlinear functions into the relatively easy analysis of linear ones. The identification of the derivative as the best linear transformation is what makes this all work.

This all extends to even more general situations. The spaces involved don't actually have to be finite dimensional, as long as there is a way of talking about the size of the vectors.

It leads on to, for example, ways of considering fluid flow, where the velocity field of the fluid can be thought of as the tangent velocity to a curve in an infinite dimensional space of fluid configurations, and of thinking about the Euler-Lagrange equations in the calculus of variations as the derivative of a functional, to name just two.

Simply the best?

This gives some of the reasons I think that the right way to think of the derivative is as the best linear approximation. It both gives insight into what the rules for derivatives of combinations of functions are actually saying, and makes the various generalizations easier to understand. It lets us see how we can use linear algebra to understand aspects of the behaviour of nonlinear functions. It generalizes still further, in ways that I have just hinted at.

But it isn't the right way for everybody. I don't think it's the right way for it to be introduced to novices, or indeed for novices to try to think about it. It really makes sense best when you already have a bunch of different-looking bits of differentiation at hand, which seem to be myusteriously linked together, when it provides a unifying framework to remove the mystery. Then it lets things drop into place together, and you can see how they're all really different aspects of the same thing. But you do have to go through a certain amount of getting to grips with the more traditional approaches in order to have a mental place for this to live.


  1. Thank you so much for this article! It took an hour to diligently get through and write some derivations myself, but it definitely changed my understanding of derivatives and how it is connected to linear algebra.

  2. I'm really pleased you got so much from it!