## Saturday, February 3, 2018

### What's a number? 2 - really

After mastering (or at least being exposed to) the arithmetic of natural numbers, we go beyond, to rational and even irrational numbers. But again, we usually do this without any real discussion of what they are. We just absorb the notion that a fraction or an infinite decimal expansion mean something, or that asking for a quantity whose square is $$2$$ is a sane thing to do.
The journey from the natural numbers to the reals is full of interesting scenery, and the destination has a depth and richness that you can spend lifetimes investigating without ever doing more than scratch the surface. I'm only going to sketch the process here, pointing out some of the more interesting sights on the way, but missing the calculations that justify it. If you haven't seen this before, I hope you'll get at least a rough idea of what's going on. If you have, maybe you'll get a new insight or perspective on it.

# How should they behave?

Just as with the natural numbers, we look at the behaviour of the numbers that we have some intuition for, and demand that whatever the real numbers are, they should obey the same rules. In particular, we want to be able to add, subtract, multiply and divide with all the usual associative, commutative and distributive properties. We also want to be able to compare numbers, so we want to be sure that given any $$a$$ and $$b$$ exactly one of $$a \lt b$$, $$a=b$$ or $$a \gt b$$ is true, and inequalities have to behave nicely, in the sense that we can add anything to both sides of an inequality, or multiply both sides by any positive number.

# Something to take away.

Equipped with only the natural numbers $$\mathbb{N}$$ (realized through the von Neumann numbers) we can always multiply and add, but sometimes we can subtract, and sometimes we can't, depending on the numbers.
We say that if $$m,n \in \mathbb{N}$$ then $$m \geq n$$ if there is some $$x\ \in \mathbb{N}$$ such that $$m=x+n$$, and $$m \gt n$$ if $$x \neq 0$$. Then as long as $$m \ge n$$ we can define $$m-n$$ to be this $$x$$.
The first step is to fit negative numbers into the picture so that we have a meaning for $$m-n$$ when this is not the case.
A neat trick for this is to think about pairs of number $$(m,n)$$, and then think of two pairs of numbers $$(m,n)$$ and $$(p,q)$$ as equivalent when $m+q=p+n$ We see straight away that if $$m \geq n$$ and $$p \geq q$$ that this equivalence is saying that $$m-n = p-q$$. We can think of any particular pair $$(m,n)$$ as representing the difference $$m-n$$, and then we make the jump to say that this is also true if $$m \lt n$$.
Now we lump the pairs together into groups of equivalent pairs, and call each of these an equivalence class. The set of all the equivalence classes is called $$\mathbb{Z}$$, the set of integers.
This is a very powerful idea, so let's take a moment to look at it more carefully.
I'm saying that I think of the pair $$(3,0)$$ as representing the same object as $$(4,1)$$, $$(5,2)$$, $$(6,3)$$, or in fact any pair of the form $$(n+3,n)$$ for any natural number $$n$$. They are all pairs of numbers where the first is $$3$$ more than the second, and they all represent the number $$3$$. Likewise, the pair $$(0,3)$$ represents the same object as $$(1,4)$$, $$(2,5)$$, $$(3,6)$$ or any pair of the form $$(n,n+3)$$ where $$n$$ is a natural number. They are all pairs where the second number in the pair is $$3$$ more than the first; they all represent the thing I want to call $$-3$$. The only thing I care about it the difference between the first number in the pair and the second.
So we need to make sure that we can add and multiply these integer objects sensibly: that means we need to know how to add and multiply representatives in such a way that whichever representatives we choose to combine, the result is always a representative of the same collection.
This is a lot like what happens if we lump numbers together into odd and even. It doesn't matter which two odd numbers I add together, I always get an even number, so I can just say that odd plus odd equals even. Likewise, I can say what happens if I add or multiply any combination of odd and even; it doesn't matter just which numbers I pick, only whether they are odd or even.
I need to do the same thing with these pairs: it mustn't matter which representative pairs I pick to operate in, I must get a representative of the same resulting class.
Here's a way that works (using juxtaposition to denote multiplication):
• $$(m,n)+(p,q)=(m+p,n+q)$$
• $$(m,n)(p,q) = (mp+nq,mq+np)$$
If we do this, we can see that each element $$n$$ of $$\mathbb{N}$$ is represented by the pair $$(n,0)$$, and we've acquired a collection of negative numbers represented by pairs of the form $$(0,n)$$.
Why is this? Well, notice that $$(n,0)+(0,n)=(n,n)$$, which is equivalent to $$(0,0)$$, which is the representative of $$0$$.
In particular we have a way of subtracting $$(n,0)$$ from $$(m,0)$$ when $$n \gt m$$. We just see how to subtract $$(n,0)$$ from $$(m,0)$$ when $$n \lt m$$, and insist that it still works. If $$n \lt m$$ we just have $$(m-n,0)$$, which is equivalent to $$(m,n)$$ and then we use the same thing when $$n \gt m$$. But we know that in this case $$(m,n)$$ is equivalent to $$(0,n-m)$$, which gives a meaning to the standard statement that if $$n \gt m$$ then $$m-n = -(n-m)$$.
We also keep the notion of order. Any pair is (equivalent to one) of the form $$(n,0)$$ for $$n \in \mathbb{N}$$ or $$(0,n)$$ where $$n \in \mathbb{N}, n \neq 0$$. We say that an integer is positive if it has a representative of the first form, and negative if it has one of the second form. Then all the usual rules about multiplying signs follow automatically.
I won't go through the check that all this works. If you want, you can see some more detail here.
Now that we have the set of integers, $$\mathbb{Z}$$, I'll just use single letters $$m,n$$ etc to represent individual integers: but bear in mind that each of these letters represents an equivalence class of pairs of natural numbers.

# Be rational

Now we can add, subtract and multiply. But mostly we can't divide. We say that $$m / n = q$$ if $$nq=m$$, but there usually isn't such a $$q$$: for example, there is nothing in $$\mathbb{N}$$ we can multiply by $$2$$ to get the answer $$5$$.
But we can use the previous trick again. The last time, we considered pairs to be equivalent if they had the same difference. We can do just the same thing but now thinking of the pairs as ratios instead of differences. So we think about pairs of integers $$(m,n)$$ where $$m,n \in \mathbb{N}$$ and $$n \neq 0$$. We now think of $$(m,n$$ and $$(p,q)$$ as equivalent if $mq=np.$
so for example, $$(1,2)$$, $$(2,4)$$, $$(3,6)$$ and so on are all equivalent. Of course, we all already know that $$\frac{1}{2}$$, $$\frac{2}{4}$$, and $$\frac{3}{6}$$ are different ways of writing the same fraction; this is all finding a way to say that when we don't have any fractions yet, only integers.
We call the set of equivalence classes of these pairs $$\mathbb{Q}$$, the set of rational numbers.
Something happens here that I find quite striking. In the previous section, we introduced solutions to an 'addition' problem, and got something where the addition rule was simple, but the multiplication one wasn't. This time, we introduce solutions to a 'multiplication' problem, and get a simple multiplication rule, but a complicated addition one.
We have
• $$(m,n)(p,q)=(mp,nq)$$
• $$(m,n)+(p,q) = (mq+np,nq)$$
It takes a while, but you can check that all the usual arithmetic rules work when you define multiplication this way.
So now if we have two rational numbers $$(m,n)$$ and $$(p,q)$$ we can see that the answer to the question 'what do I multiply $$(p,q)$$ by to get $$(m,n)$$?' is $$(mq,np)$$ as long as $$p \neq 0$$.
What's more, every rational can be represented by a pair of the form $$(m,n)$$ where $$n \gt 0$$. Then we say that the rational number has the same sign (positive, zero, or negative) as $$m$$ does. And then inequalities behave themselves: you can add the same thing to both sides or multiply both sides by a positive quantity.
So now we have a set of objects (rather complicated objects, to be sure) which contain the natural numbers we started. With these numbers, we can add, subtract, multiply and divide (as long as we don't try to divide by $$0$$).
Let's stop to look around.
By means of two algebraic steps - really the same trick applied twice, once for addition and once for multiplication - we have gone from the natural numbers $$\mathbb{N}$$ to the rational numbers $$\mathbb{Q}$$.
In the rational numbers, we have a well-behaved addition and multiplication. Both are associative and commutative and the multiplication distributes over addition. There is an additive identity, $$0$$, and a multiplicative identity, $$1$$. Every rational number has an additive inverse, and any non-zero rational number has a multiplicative inverse.
So algebraically, the rational numbers are as well-behaved as you could hope for.
What's more, they are nicely ordered: we know what it means for a rational number to be positive or negative, and inequalities behave themselves, by which I mean that they are preserved by addition of any rational number or multiplication by any positive one.
And yet they are unsatisfactory. If we draw an isosceles right triangle with two sides of length $$1$$, then the length of the hypotenuse should give $$2$$ when squared. Unfortunately, there is no rational number with this property. There are rational numbers whose square is less than $$2$$, and those whose square is greater than $$2$$, and we can get the square as close as we want to to $$2$$, but we can't get it exactly.
And this isn't the only missing value. So what can we do about it?

# Mind the gap

This step is rather different from the others. The others filled in algebraic holes - a lack of additive or multiplicative inverses. Now we have to fill in a different kind of hole - one where we seem to be able to get as close as we want to to some value, but the value itself is missing.
First, we recall what it means for a sequence of numbers, $$q_n$$, to converge to a value, $$q$$. This is captured by saying that given any tolerance, the sequence eventually strays no farther than that from the limit: in the standard mathematical symbols, $\forall \epsilon \gt 0, \exists N \in \mathbb{N} | n \gt N \Rightarrow |q_n-q| \lt \epsilon.$ But this isn't as helpful as we might hope, because it's the sequences that look as if they converge but don't which are the problem. Fortunately, Cauchy came up with a clever way of characterizing a sequence which looks as if it ought to converge. This is that given any tolerance, pairs of terms in the sequence are eventually within that tolerance of each other. This time we have $\forall \epsilon \gt 0, \exists N \in \mathbb{N} | m,n \gt N \Rightarrow |q_n-q_m| \lt \epsilon$ and a sequence which satisfies this criterion is called a Cauchy sequence.
And now, just as we constructed objects to solve equations of the form $$x+n=m$$ and $$xn=m,$$ we need to construct objects to act as the limit of a sequence which looks convergent.
Again, we do it by defining some equivalence classes, which will fill in the gaps in $$\mathbb{Q}$$ and be the real numbers. To make these equivalence classes, we say that two Cauchy sequences $$q_n$$ and $$r_n$$ are equivalent if $$q_n - r_n \to 0$$. Then the set of equivalence classes is $$\mathbb{R}$$, the set of real numbers. This time the element $$q$$ of $$\mathbb{Q}$$ is represented in $$\mathbb{R}$$ by the constant sequence all of whose terms are $$q$$.
Now we have to define the sum and product of two real numbers. This is easy enough: if $$x_n$$ represents the real number $$x$$, and $$y_n$$ the real number $$y$$, then $x_n+y_n \text{ represents } x+y, \qquad x_n y_n \text{ represents } xy$ It takes some effort, but we can again show that all the algebra keeps working as before. We can add, subtract, multiply and divide (except by zero), and all the standard algebraic properties still hold.
What do we get from this?
For example, if $$m$$ is a natural number, we can define a sequence $$x_n$$ by $\begin{split} x_0 &= 1\\ x_{n+1} &= \frac{x_n}{2}+\frac{m}{2x_n} \mbox{ if } n \gt 0 \end{split}$ This sequence satisfies Cauchy's criterion, and if it has a limit, $$x$$, that limit satisfies $$x^2=m$$; so we certainly get square roots for all the natural numbers.
In fact, we get a lot more new numbers besides these.
Order is also simple to define: $$x \lt y$$ if for all $$n$$ sufficiently large, $$x_n \lt y_n$$. And again, it takes some effort, but we can check that the order still works in the same way.
We're not done yet. This makes sure that every sequence of rationals that looks as if it should converge has a limit. When the limit is not itself a rational number, we call it irrational.
But there's a potential problem here. We've got a way of adding in to the number system all the limits of Cauchy sequences of rational numbers which don't have a rational limit. But maybe there are still gaps. Maybe there are Cauchy sequences of the real numbers we've just constructed that don't have a limit (in this set). How often do we have to repeat this process? Could it even be that no matter how often we do it, there are still Cauchy sequences that don't converge in the set of numbers we've created so far?
The good news is that this doesn't happen. Having carried out this process once, we don't have to do it again. The real numbers are complete in the sense that every sequence that looks convergent (i.e. every Cauchy sequence) actually does have a limit in the real numbers.

# The end of the road

And now we have a set that we can feel reasonably happy to call real numbers. They can be added, subtracted, multiplied and divided (except by zero). The rational numbers, the integers, and the natural numbers all sit inside them in what seems like a sensible way. They are ordered in a way that talks nicely with the algebra. And finally, there are no gaps.
Unfortunately, the actual objects are very unwieldy. A real number is an equivalence class of Cauchy sequences of equivalence classes of rationals, rationals are equivalence classes of pairs of integers, integers are equivalence classes of pairs of natural numbers, and natural numbers are those sets we call von Neumann numbers. This is awful. How are we supposed to do anything with objects like that?
It would make our lives much easier if we had a result like the one for the Peano axioms: that any two sets satisfying those axioms are really the same set but differently labelled. Unfortunately that isn't true for the properties we have so far.
But it's nearly true. The structure we've created has the property that if $$r \in \mathbb{R}$$, then there is some $$n \in \mathbb{N}$$ such that $$n \gt r$$. This is called the Archimedean property. And if we put that into the mix, then we do have the same uniqueness result as before. So the good news is that as with the natural numbers, we don't really care about the construction: what it does is to demonstrate that the algebraic, order, and completeness properties we want are satisfied by essentially one object. And now we can just work with the rules, without caring about what is 'inside' our real numbers: how they behave is all that matters, and we can describe that without caring about the insides.

# Odds and Ends

## Decimal representation

So, what does this have to do with the usual representation of a real number as in integer plus a (possibly neverending) decimal part?
We define $$m.d_1 d_2 d_3 \ldots$$ to be the real number defined by the Cauchy sequence whose $$n$$th term is $m + \sum_{i=1}^N \frac{d_i}{10^i}$ In fact, this is one way of understanding what we mean by $m+\lim_{N \to \infty} \sum_{i=1}^N\frac{d_i}{10^i}.$ Not only does this always give a Cauchy sequence (relatively easy to prove), but for any real number there is a sequence of this form (not quite so easy). So the real numbers actually are the familiar decimal objects.
As an added bonus, now you can really understand why $$0.\dot{9}=1$$: the Cauchy sequence $$0.9$$, $$0.99$$, $$0.999 \ldots$$ is in the same equivalence class as the constant sequence $$1,1,1\ldots$$, so the two are just different representations of the same real number, namely $$1$$.

## Levels of irrationality

There are two different types of irrational number, both of which are produced by this single construction. We say that a number is algebraic if it a root of a polynomial equation with integer coefficients. So all rational numbers are algebraic, and so also is $$\sqrt{m}$$ for any natural number $$m$$.
But there are also irrational numbers which are not the root of any such polynomial: $$\pi$$ and $$e$$ are the best known of these. These are called transcendental. Trying to understand these is the object of transcendental number theory. One surprising property of the transcendental numbers is that they are easier to approximate by rational numbers than the irrational algebraic numbers. (Easy to approximate has to do with how close you can get with a denominator of a given size.)

# Alternative Routes

There's more than one way to this endpoint.
We could start with the strictly positive integers, and proceed via the positive rationals to the set of all rationals. This is slightly more elegant in that we don't have to worry about the denominator being zero when we build the rationals. On the other hand, we don't get the integers as a natural subset of the rationals this way.
Once we have the rationals, there are various ways of filling in the gaps, by using different characterizations of just what the gaps are, two of which are particularly worth mentions.
One is to consider ways of splitting up the rationals into sets $$L$$ and $$U$$ so that $$L$$ and $$U$$ between them contain all rationals, every element of $$L$$ is less than every element of $$U$$, and $$L$$ has no greatest element. Then we introduce the irrational numbers as least elements of those sets $$U$$ which do not have a least element. This construction is known as the Dedekind cut after the late nineteenth and early twentieth century mathematician Richard Dedekind, and the idea is based on that of Eudoxus, a 4th Century BCE Greek mathematician.
The other is to say that any set which has an upper bound (i.e. a number greater than every element of the set) has a least upper bound. The rationals don't have this property, but if we add in the required least upper bounds this has the same effect of introducing all the required irrationals.
But again, the important thing is that whichever route we choose, we arrive at the same endpoint. The details of the objects we finally end up with will differ, but since the properties are always the same, we don't need to know just what they are. How they behave really is all that matters.

# Keep right on past the end of the road

This has been quite a journey. We've filled in two algebraic gaps and one analytic one. But we don't actually have to stop here. There are still some missing objects. We can't, for example, solve the equation $$x^2+1=0$$ in $$\mathbb{R}$$. And there are no gaps between the real numbers to put the solution in. We can fill in the gap by using a different kind of trick, and introducing the complex numbers. This almost works, but this time as well as gaining something, you lose something: there is no way of ordering the complex numbers in a way that plays nicely with the algebra. But that's another story.

### Acknowledgement

Thanks to Colin Wright (@ColinTheMathmo) for helpful comments on an earlier version of this and @BTNMathsJam for typo spotting.

## Thursday, January 18, 2018

### What's a number? 1 - naturally

In mathematics (and before that, arithmetic) we get used to dealing with numbers in various ways. But what is a number?

# What sort of a question is that?

Of course, we all learn to count, and for the most part we learn to become competent. In practice, what this means is that there is an agreed-on sequences of noises that we make when we point at objects, and the number of objects is the noise we make when we get to the last one. And with practice, we find that if we're careful we always make the same noise for the last item of a collection, even if you point at them in a different order. We also find that if you have a pair of collections then they can be matched up in a one to one fashion if and only if we assign the same noise to both of them.
With a little thought, we might come to the conclusion that, for example, three is the name of a number, and any collection that we get to three on by the 'point-and-make noises' algorithm contains three items. But just what is this three? It's the thing that all collections with three items have in common, but what is this 'three-ness' that they all share?
Not knowing just what a number is doesn't stop us learning how to do arithmetic, and this arithmetic is generally very useful. It prevents us from being cheated too easily in shops, and ensures that we know how many minibuses to book for a trip to the seaside, or how with more effort to direct a spaceship to the moon. But a chosen few eventually get uncomfortable, and start asking awkward questions about just what numbers are.
One rather cunning attempt to deal with the problem is to think about sets, and then to define three to be the set of all sets with three elements. Then to say that a set has three elements is to say that it is in this set (of sets).
Alas, that doesn't work. When you try to think about sets carefully, you discover that you have to be a bit careful about just which descriptions actually describe sets. And when you do that, you find that 'has three elements' doesn't describe one, because if you try to include it as a set you get contradictions: this is a Bad Thing. This was a bit of a blow to early developers of set theory, so another solution has to be found.

# Who cares what a number is?

An alternative, and even more cunning attempt to deal with the problem is to ignore it. After all, humanity has been counting successfully for millennia without ever coming to any grief caused by not knowing what a number is. Surely all that matters is that, whatever they actually are, we know how they behave. So the solution to the problem is that if we want to do mathematics rather than just indulge in unimportant activities like making sure we get the right change when we buy a bar of chocolate, then we describe how numbers behave by means of some mathematical rules, and then work with the rules.
Well, time to be a bit more specific. I'm talking about the numbers we use to count things and do elementary arithmetic with here, so I'm talking about what are sometimes called the natural numbers, and I'll include in that the special number zero. (Some people like to start the natural numbers at one, but I'm not one of them.)
It's important to realise that there is no right way of doing this. We have to sit down and think about how numbers behave, and try to capture that behaviour is a neat collection of rules.
So we play with numbers (whatever they are) or rather some kind of representatives of them, and we come to various conclusions. We can add them, we can multiply them, order doesn't matter with addition and multiplication, and so on. Observations like this can be gathered indefinitely, so we need to pick out a collection of fundamental properties that the others follow from: an axiom system for the natural numbers.
One way of doing this is due to Giuseppe Peano, who wrote a book about it in 1889 (and in Latin: this was a time when by writing in Latin you might still hope to increase your audience rather than shrinking it). Nowadays, we tend to present things rather differently, but nevertheless Peano's axioms survive in somewhat altered form. So, here is one version of the Peano axioms for the natural numbers, consisting of a set $$\mathbb{N}$$ with a successor function $$S:\mathbb{N} \to \mathbb{N}$$:
1. $$0 \in \mathbb{N}$$
2. If $$n \in \mathbb{N}$$, then $$0 \neq S(n)$$
3. If $$m \neq n \in \mathbb{N}$$, then $$S(m) \neq S(n)$$
4. If $$K \subseteq \mathbb{N}$$ such that $$0 \in K$$ and $$n \in K \Rightarrow S(n) \in K$$, then $$K = \mathbb{N}$$
We will also use the convenient shorthand $$S(0)=1$$, $$S(S(0))=S(1)=2$$ and so on, with the familiar numerals; privately we think of $$S$$ as adding $$1$$.
This certainly seems to capture some of how we think the natural numbers behave. $$0$$ is the first one, we get all the rest by adding $$1$$, and we never get the same answer by adding $$1$$ to two different numbers. The last is less obvious, but it's just a slightly disguised way of saying 'proof by induction works'.
Now comes the surprise (at least, it surprises me).
These axioms essentially specify the natural numbers uniquely.
By that I mean that if we have another set, $$\mathbb{N}'$$ with an element $$0'$$ and a successor function $$S'$$ satisfying the same axioms, then there is a (unique!) bijection $$i:\mathbb{N} \to \mathbb{N}'$$ such that $$i(0)=0'$$, and $$S(i(0))=i(S(0))$$. So any two sets claiming to be the natural numbers are really just the same set with two different ways of labelling the elements and successor function.
Hang on, though (you may be thinking). It these things are meant to be the natural numbers, shouldn't there be operations of addition and multiplication?
Indeed there should. And they should have all the usual properties too of associativity, commutativity and distribution. But now we know that if we can define them in terms of $$0$$ and the successor function, then the natural numbers are still basically unique. (And we don't have to come check the properties of addition and multiplication work separately.) So, here's how addition and multiplication work for $$m,n \in \mathbb{N}$$, using $$+$$ for addition and $$\cdot$$ for multiplication:
1. $$m+0 = m$$
2. $$m+S(n) = S(m+n)$$
Multiplication
1. $$m \cdot 0 = 0$$
2. $$m \cdot S(n) = m \cdot n+m$$
These definitions are both inductive: they tell us how to combine a number with $$0$$, and if we have to combine a number with something other than $$0$$ we reduce it to dealing with a number nearer $$0$$.
I wouldn't like to claim that it is trivial, but it is straightforward to show that addition and multiplication as defined are associative and commutative and that the distributive laws hold.

# That's all very well, but are there any numbers?

And here's the snag. It's all very well saying that we don't care what they are, only how they behave. But how do we know that anything behaves like that?
Remember, we have a bunch of properties that we have abstracted from experience. For the numbers we're directly familiar with (say, up to a dozen or so) we have a fairly direct perception of the truth of the various properties. But why do we believe that they keep on going for numbers which are arbitrarily large? Why do we even believe that the mathematical structure makes sense? How do we know that the axioms are consistent? It could be that we have asked for too many properties, and defined the natural numbers out of existence.
Aside: probably apocryphal stories about PhD students who have proved hundreds of interesting properties of things that don't actually exist abound. We really want to avoid that situation.
In fact, asking about consistency of axioms open an enormous can of worms, which I don't propose to get into right now. I'll settle for a less rigorous approach that is at least very (very very) plausible, and which was invented by the amazing John von Neumann.
First, we assume that set theory is consistent. We don't even need scary bits of set theory, we just assume that no problems arise with some simple sets that I'll build as we go along.
First of all, we have the empty set, $$\emptyset$$. This will be $$0$$ in the model I'm going to describe for Peano's axioms. (It's a model in a highly technical sense too, but I also don't propose to get into that.)
Then the first few natural numbers are given by $\begin{split} 0 &= \emptyset \\ 1 &= S(0)= 0 \cup \{0\} = \{0\} \\ 2 &= S(1)=1 \cup \{1\} = \{0,1\} \\ 3 &= S(2)= 2 \cup \{2\} = \{0,1,2\} \end{split}$ and in general $n+1 = S(n) = n\cup \{n\} = \{0,1,\ldots n\}$ with $\mathbb{N} = \{0,1,2,3\ldots \}$
You might wonder, as I did when I first met this, why we don't just use $$S(n)=\{n\}$$. That's simpler. Well, there's nothing to stop you doing that, but it's less elegant: for a start, in this picture the set $$4$$ has 4 elements, which is handy. In fact, there are other nice properties of these integers, such as the fact that $$m \lt n$$ is equivalent to $$m \in n$$, which has some huge ramifications. Those are for maybe a future post. For now, we settle for having a set and successor function that look as if they work.

# So the natural numbers are these weird sets?

Fortunately not. These von Neumann numbers (often called the von Neumann integers, even though they don't include the negative ones) are just to show that the axioms make sense. The point of them is to say that the axiomatization makes sense, so it is in fact safe not to worry about what numbers are, just how they behave.
It's worth noting that there are models of the Peano axioms that look completely different. The fact that there is a bijection that preserves the structure doesn't mean that the objects have to bear any resemblance to each other.
So here's an entirely different approach. We consider a world of sets and functions, where the domain and codomain of each function is the same. I mean here if I look at any function, it maps some function to itself: different functions can do this with different sets. Then my integers are operators on the collection of functions that are built up like this:
1. $$0(f)$$ is the identity function.
2. $$S(n)(f)$$ is $$f \circ n(f)$$
This is the idea of Alonzo Church, and can be expressed neatly using his lambda calculus . (In fact the lambda calculus gives a nice way of expressing addition and multiplication in this model of the integers.)
The point is that these integers then are nothing like the von Neumann integers. They are instructions that tell you how often to compose a function with itself. But they have exactly the same structure as the von Neuman integers.
We could even make a model using strings of characters using the alphabet $$\{0,1,2,3,4,5,6,7,8,9\}$$ where everything is defined using the usual rules for decimal arithmetic. (Yuck.)
And none of these is really the natural numbers. Or they all are, because they're all really the same, we're just expressing one ideas in several different way. The point is that the axiomatic approach tells us that it doesn't matter how we think about the natural numbers, we can reason about them just by using the axioms, secure in the knowledge that whatever we deduce will be true of whatever concrete realization we prefer.

# Are you glossing over anything?

Lots. One is the the subtlety of the inductive axiom. I gave it as involving an arbitrary subset of $$\mathbb{N}$$. But if we do logic (and sets) carefully, this is not part of first-order logic. We'd have to say something like
If $$P$$ is any predicates such that $$P(0)$$ is true and $$P(k) \Rightarrow P(k+1)$$ for al $$k \geq 0$$ then $$P(n)$$ is true for all $$n \in \mathbb{N}$$.
This doesn't look like such a big deal. And, after all, it's how we use it.
But it's an enormous deal. If you use this as your axiom, then the axioms are not strong enough to tie down the natural numbers to essentially one version. You can have non-standard models, which included numbers larger than all the normal integers. This is also tied up with non-standard analysis.
I've also glossed over the whole of (axiomatic) set theory, the attempt to put the idea of a set on a reliable footing so that it can't be broken by something like Russell's paradox. It's a fascinating topic in its own right, and a study of it explains why you get into trouble by trying to have a set of all sets with three elements. Interesting as it is, though, we don't need to go to all that effort to appreciate why the von Neumann numbers give a nice model.

# Hang on, $$0$$ isn't a natural number! This is all wrong!

It is true, not everybody includes $$0$$ as a natural number: some only consider the strictly positive integers to be natural numbers. It's one of mathematics's little ironies that there is no universally agreed definition of a natural number. Neither definition is clearly better than the other: each has its minor context dependent advantages and disadvantages.
If I had to draw up approximate lines, I'd probably say that in general terms logicians and set theorists tend to include $$0$$, number theorists tend not to. And I will admit now (to save anybody the effort of pointing it out later) that Peano started at $$1$$, not $$0$$.
I've included $$0$$ because it's what I'm used to, and because the von Neumann integers are such a compelling model, and it has a natural place there. I also like there being an additive identity. If you really can't bear the thought if $$0$$ being a natural number (and if you can't, I can't imagine what compelled you to get this far) then just consider this whole thing as am discussion of what the non-negative integers are.

# Why does the title have a 1 in it?

Now that we know what the natural numbers are (or aren't) we can build on this to make other number systems. I intend to explore a couple of them (eventually) in subsequent blog posts.

## Friday, December 29, 2017

### Measuring up to Heisenberg (and since)

'Everybody knows' that the uncertainty principle of quantum mechanics tells us that we can't make a measurement on a system without disturbing it, and that that's why you can't simultaneously measure the position and momentum of a particle. This isn't an entirely unreasonable thing to think: it is, after all, closely related to Heisenberg's original argument involving using light to probe the properties of a particle. But it's wrong. The situation in quantum mechanics is subtler and much more interesting than that.

# Statistics rears its ugly head

Suppose we have a large collection of boxes, each containing a particle, and they have been prepared so that the particle in every box is in exactly the same state. (Or if you prefer, one box containing a particle, prepared in the same way a large number of times.)
You may be wondering how we do that. I'm not going to tell you, largely because I have no idea. But I can tell you what quantum mechanics predicts about the measurements we make on the particles inside. You may think this is unfair, but it's no less fair than using classical mechanics to tell you what happens to a ball thrown at 20 meters per second at an upward angle of 30 degrees from a cliff edge 100 meters above sea level: classical mechanics doesn't tell us how to arrange that situation either, only what the consequences are.
Now we measure the $$x$$-coordinate of the position of each particle. It could be that every single box is guaranteed to give the same value for this measurement. In this case we say that the particle is in an eigenstate of $$x$$-position, and the eigenvalue is the $$x$$ value that we measure.
But this is not a very likely occurrence. It is much more likely that the measurements result in different values. Then we can compute the average value of all the measurements, and if we have the enthusiasm and stamina, we can compute the standard deviation. Doing this lots of times with differently prepared particles, we find that sometimes the standard deviation is large, and sometimes it is small: so sometimes the particle has a relatively well-defined position (in the sense that the measurements tend to be close together), and sometimes not.
Instead of measuring the $$x$$-coordinate of the position of each particle, we might measure the $$x$$-component of its momentum. A similar situation results. Depending on the initial preparation, we sometimes find a large spread of results, and sometimes a small spread.
There is an obvious question we might ask: what is the relationship between the measurements of position and momentum? The answer is simple to describe, but not quite so easy to explain.
It turns out that we can prepare the particles so that they have a very small standard deviation in their position measurements. We can also prepare them so that they have a very small standard deviation in their momentum measurements. But we can't prepare them in a way where both sets of measurements have a small standard deviation. In fact, the product of the two standard deviations cannot be reduced below a certain value. The more tightly clustered the position measurements are, the more spread out are the momentum measurements; conversely, if the momentum measurements are tightly clustered, the position ones are more spread out.
This is not easy to understand from the point of view that the act of measurement disturbs the system. All the measurement are being carried out on particles in the same state, but some measure position and others measure momentum. So what is going on here?

# What is a particle's state?

This is one way in which quantum mechanics is really different from classical mechanics.
In classical mechanics, the state of a particle is described completely by its position $$\pmb{x}$$ and momentum $$\pmb{p}$$. Given these quantities and the knowledge of the forces acting on the particle, the subsequent values of position and momentum are entirely determined.
The quantum mechanics picture is rather different.

## Observables

In quantum mechanics, the state of a system is described by a vector, often denoted $$\psi$$. So far, this does not sound so different from the classical mechanics situation: we can think of the position and momentum as a vector too. But it's a different kind of vector.
In the quantum mechanics picture of the universe, each quantity that we can observe is associated to a linear operator (in fact, a special kind of linear operator, called Hermitian, but I won't go into that level of detail here) I'll use $$H$$ to represent a (Hermitian) linear operator. What this means is that $$H$$ is a machine that I can feed a vector and it returns a vector in a well-behaved way: if $$\alpha, \beta$$ are any two complex numbers, and $$\psi,\phi$$ are any two vectors, then $H(\alpha \psi + \beta \phi) = \alpha H(\psi) + \beta H(\phi)$
Although I slid past it fairly quickly just then, you may have noticed that I said $$\alpha$$ and $$\beta$$ could be complex numbers: this is one of the ways that quantum mechanics is different from classical mechanics---the mathematics of complex numbers is there at the fundamental level of the kind of vectors that can be used to describe the state of a system.
Now, if we choose our state vector very carefully, we can have the situation $H(\psi) = \lambda \psi$ where $$\lambda$$ is a real number. When this happens, $$\psi$$ is called an eigenvector of $$H$$, and $$\lambda$$ is the associated eigenvalue. (For general linear operators the eigenvalues can be complex, but for Hermitian operators they must be real.)
Eigenvectors are very important in this story. I'll explain why shortly, but first, we need to know this: If $$H$$ is the linear operator corresponding to an observable, then there are eigenvectors $$\psi_i$$ of $$H$$ such that any state vector $$\psi$$ can be expressed in the form $\psi = \sum_{i} \alpha_i \psi_i$ in just one way, where $$\sum_i |\alpha_i|^2 = 1$$. (We call this set $$\{\psi_i\}$$ the basis of eigenvectors associated with $$H$$).

## Observations

'Just what,' you should be wondering by now, 'does this have to do with measuring a property of a particle?'
Well, given an observable, $$H$$, (note that I'm abusing terminology a bit here by referring to the linear operator as the observable: I do that) and its associated basis, $$\{\psi_i\}$$, we know that each $H (\psi_i) = \lambda_i \psi_i$ for some real number $$\lambda_i$$.
Then if we have a particle in the state $\psi = \sum_{i} \alpha_i \psi_i$ We can carry out a measurement of the observable corresponding to $$H$$.
The result of the measurement must be one of the values $$\lambda_i$$, and a result of $$\lambda_i$$ occurs with probability $$|\alpha_i|^2$$. (So we don't know just what the result will be, just the probabilities of the various possible results.) Furthermore, immediately after the measurement, if the measured value is $$\lambda_K$$, then the state of the system is $$\psi_K$$. (This is called the collapse of the wave function.)
Quantum mechanics does not tell us which outcome will occur, it only tells us the probabilities of the various outcomes.
So for example, if the state of the system were $$\psi = 0.6 \psi_1 + 0.8 \psi_2$$, then the result of measuring $$H$$ would be $$\lambda_1$$ $$0.36$$ of the time (and immediately after these measurements the system would be in state $$\psi_1$$), and it would be $$\lambda_2$$ the remaining $$0.64$$ of the time (and immediately after those measurements the system would be in state $$\psi_2$$).
On the other hand, if the state of the system were $$\psi=\psi_1$$, then the result of measuring $$H$$ would always be $$\lambda_1$$, and the state of the system would be unchanged by the measurement. (This, of course, contradicts the claim that it is impossible to make a measurement without disturbing the system.)
So we can see that (given this model of particle states and measurements), the average and spread of measurements is determined by the coefficients $$\alpha_i$$; if most of the weight is associated with a particular basis vector, then the results have a very small spread, and if many basis vectors make a significant contribution, then the results are more spread out. |

# Uncertainty

So now we can try to see what this might suggest about measurements of position and momentum. Let's call the operator corresponding to measurement of $$x$$-position $$X$$, with corresponding basis $$\{\psi_i\}$$ and the operator for the momentum measurement, $$P$$, with corresponding basis $$\{\phi_i\}$$.
The crux of the situation is that the $$\psi_i$$ cannot be eigenvectors of $$P$$, and the $$\phi_i$$ cannot be eigenvectors of $$X$$.
So if a particle is prepared in a state which guarantees a particular value of $$x$$, so $$\psi = \psi_i$$, then it must be a combination of $$\phi_i$$; likewise, if $$\psi=\phi_i$$, it must be a combination of $$\psi_i$$. So certainty of position means that momentum is uncertain, and vice versa.

## What's going on here?

The problem is that the operators $$P$$ and $$X$$ do not commute: if $$\psi$$ is a state vector, then $$PX(\psi) \neq XP(\psi)$$. The mathematics of Hermitian operators tells us that it is possible to find a basis of vectors which are eigenvectors for a pair of $$H_1$$ and $$H_2$$ if and only if the two commute, i.e. $H_1H_2 (\psi) = H_2H_1(\psi)$ for all vectors $$psi$$.
We can define the operator $$[H_1,H_2]=H_1H_2 - H_2H_1$$, called the commutator of $$H_1$$ and $$H_2$$, and then we can equivalently say that $$H_1$$ and $$H_2$$ commute if $[H_1,H_2] = 0$ (meaning that $$[H_1H_2](\psi =0)$$ for all $$\psi$$.
It then follows that since $$P$$ and $$X$$ do not commute, one cannot have simultaneously sharply defined values for both the $$x$$-position and the $$x$$-component of momentum for a particle.
On the other hand, if you find two observable which do commute, then it is possible to have sharply defined values for both; indeed, either both will be sharply defined or neither will!

## Heisenberg's uncertainty relation

Heisenberg's uncertainty relation (as we now understand it) tells us how the extent to which two operators fail to commute tells the extent to which values of the corresponding observable cannot be simultaneously well-defined. More precisely, given a state $$\psi$$, and observables $$A$$, and $$B$$ the product of the standard deviations of measurements of $$A$$ and $$B$$ must be at least half the length of the vector $$[H_1,H_2](\psi)$$.
In the case of $$x$$-coordinate of position and $$x$$-component of momentum, the length of $$[X,P](\psi)$$ is always $$h/2\pi$$, given the well-known uncertainty relation.

# Just a minute $$\ldots$$

We've actually shifted ground in a fairly radical way here. We've gone from 'you can't measure a particle's position and momentum simultaneously' which comes from a world-view in which it has a position and momentum, but we can't access the values', to 'a particle's state does not simultaneously specify a sharp position and momentum' which comes from a world-view where the state of a particle just isn't the same information as it is in classical physics.
This, together with the fact that the theory is probabilistic rather than deterministic has been a significant issue in the development of quantum mechanics. The desire to maintain a 'realist', deterministic picture of physics, in which particles do have well-defined position and momentum even if we can't know them survives, and even now has its proponents.
And all this doesn't actually explain the uncertainty principle. It just provides a mathematical/theoretical framework which it is a consequence of. But that's OK, because that's generally how physics proceeds: we have phenomena that are hard to understand, and we develop a structure in which those phenomena are natural. Once we've become thoroughly accustomed to that structure, we tell ourselves we understand what's going on. At least, until the next Big Idea comes along and we start all over again.

If this has whetted your appetite for more detail, there are, of course, many places to go to. This is a short list of 'not a standard undergraduate textbook' books which I think are interesting.
• Susskind and Friedman Quantum mechanics: the theoretical mininum gives a careful presentation that doesn't require a huge amount of mathematical backjgound. The first five chapters fill in most of what I've glossed over above, and then it goes on to look at other interesting stuff such as entanglement. It also explains how the quantum state evolves in time, which I haven't mentioned at all.
• Penrose's The Emperor's New Mind (and his other more recent tomes) give wonderful insights into quantum mechanics (inter alia), including some speculative ideas about wave function collapse.
• Bohm and Hiley's The undivided universe is an exposition of pilot wave theory, which provides a deterministic and realist model of quantum mechanics. I haven't invested the time and effort to be able to do more than say: this looks like a good place to start if you want to know more.
• Feynman's lecture notes are, of course, worth looking at once you have some idea what's going on.
You may have noticed that I haven't listed any resources on the philosophy of quantum mechanics. This is because I know my limitations.

## Friday, November 17, 2017

### Russian peasant cryptography

What is the relationship between Russian peasant multiplication and modern cryptography? Unlike the multiplication procedures commonly taught, it can easily be adapted to compute powers rather than products; and we need to calculate powers for some cryptosystems.
I should probably admit right away that this is not going to be at all informative about how Russian peasants managed keep their communications secret from their aristocratic overlords.

# Russian Peasant Multiplication

If you haven't seen this before, it's a rather neat approach to multiplication which doesn't require the memorization of multiplication tables, but only the ability to double or half integers. It seems to be rather unclear just what it has do do with Russian peasants: it's certainly less than Danish pastries have to do with Denmark. The method was known to the ancient Egyptians, who considerably predate Russian peasants.
The general description is simple, but not very informative. Nevertheless, I will start by presenting it.
If you want to multiply together two integers, $$m$$ and $$n$$, form two columns, one headed by $$m$$ and the other by $$n$$. Then repeatedly add a new row obtained by halving $$m$$ (discarding any remainder) and doubling $$n$$, stopping when you end up with a $$1$$ in the $$m$$ column. Finally, add up all the entries in the $$n$$ column where the $$m$$ value is odd.
It's much more informative to see an example. So let's work out $$13 \times 7$$. This gives us the table $\begin{array}{|c|c|} \hline m & n \\ \hline 13 & 7\\ 6 & 14\\ 3 & 28\\ 1 & 56\\ \hline \end{array}$ after which we add up the right hand entries when the left hand entry is odd, giving $7+28+56 = 91$ which we can quickly check is $$13 \times 7$$.
The interesting question is, of course, why does this mysterious looking procedure work?
If we look at the left hand column, and work our way up, recording a $$1$$ when we see an odd number, and a $$0$$ when we see an even one, we obtain $$1101$$, which is (by no coincidence whatever) the binary representation of $$13$$. In fact, this is exactly the standard algorithm for converting numbers to their binary representation.
Then we can see that the repeated doubling of the $$m$$ column is calculating the product of $$m$$ with the corresponding power of $$2$$, so that adding those corresponding to an odd value in the $$n$$ column is adding up $$m$$ multiplied by those powers of $$2$$ which sum to $$m$$.
So by using this we can multiply any integers together, and what's more the algorithm is reasonably efficient. I admit that it's not as efficient as the column or grid methods usually taught now, but at least you don't need to know your times tables to do it.
This is rather nice. At least, I like it. But rather remarkably, it can be adapted slightly to work out something apparently much more complicated, and very useful in modern (public key) cryptography.

# Russian peasant exponentiation

We make two changes to the above algorithm above, both only involving the $$m$$ column.
• Instead of doubling, we square

Then in each row of the table, instead of having $$m$$ multiplied by the corresponding power of $$2$$, we have it raised to the corresponding power of $$2$$. Multiplying all these together then gives us $$n^m$$.
Let's see how it works with the previous $$m$$ and $$n$$. $\begin{array}{|c|c|} \hline m & n \\ \hline 13 & 7\\ 6 & 49\\ 3 & 2401\\ 1 & 5764801\\ \hline \end{array}$ and multiplying the relevant terms together gives $5764801 \times 2401 \times 7 = 96889010407$ which is, indeed, $$7^{13}$$.
But unlike the multiplication, this isn't just reasonably efficient: this is remarkably efficient.
If we want to find $$n$$ to the power of $$m$$, we could do it by multiplying together $$m$$ lots of $$n$$. This is like finding $$m \times n$$ by adding together $$m$$ lots of $$n$$: you can do it, but if $$m$$ is bigger than about $$3$$ you really have better things to do with your time.
But with this algorithm, the number of squarings that you have to do is at most about $$3$$ times the number of digits of $$n$$ (since the binary representation of a decimal number has about $$3$$ times as many bits as the decimal one has digits), followed by that many multiplications.
Of course, we need to be able to multiply to do this, but that's OK: the multiplication algorithm lets us multiply by doubling and adding, and now the adapted version lets us exponentiate by squaring and multiplying. It's still a good deal.

# A quick outline of RSA

The point of this is that in the RSA (Rivest-Shamir-Adleman) approach to public key cryptography (and also in some others), we need to compute powers in order to encrypt and decrypt messages.
The way that the procedure works is that you choose a private pair of primes, $$p$$ and $$q$$, and tell the world the value of $$N=pq$$. (You keep $$p$$ and $$q$$ secret: the algorithm relies on the fact that factorizing $$N$$ is hard.)
Then you pick a number $$e$$ and find $$d$$ with the property that $$ed$$ is $$1$$ more than a multiple of $$(p-1)(q-1)$$. You tell the whole world $$e$$. (You keep $$d$$ secret. The algorithm relies on the fact that finding $$d$$ without knowing $$p$$ and $$q$$ is hard.)
Then you represent your message as an integer $$M \lt N$$. The encrypted form of your message, $$E$$ is the remainder when you divide $$M^e$$ by $$N$$.
Then by the magic of number theory (actually, it isn't magic, it's Fermat's Little Theorem) the original message is the remainder when you divide $$E^d$$ by $$N$$.
In principle, this is all there is to it, and there are many descriptions available, often illustrated using values of $$N$$ which are small enough to make all the arithmetic tractable.
But there's an issue.
In practice, the values of $$M$$ and at least one of $$e$$ or $$d$$ is very large. If you had to do repeated multiplication, the universe would die of old age before you were finished. This algorithm reduces the amount of work to manageable proportions. $$e$$ (or $$d$$) multiplications are reduced to a number of multiplications that is a small multiple of the number of digits of $$e$$ (or $$d$$).
That wasn't the only issue.
The value of $$M^N$$ is almost inconceivably large. You couldn't write it down if you used every elementary particle in the known universe to hold one digit. So how does that work?
The answer is more number theory. One of the really nice properties of working out remainders is that it doesn't matter where in a calculation you do it. If you want to know the remainder of some quantity when you divide by $$N$$, then you can calculate the whole thing then divide by $$N$$; or, rather more sensibly, you can work it out in stages and take the remainder whenever the working value exceeds $$N$$. You are guaranteed to obtain the same result in the end.

# Russian Peasant RSA

So now we can see how to compute an encrypted message. We can use the adapted Russian peasant multiplication, but now we just add the final detail that whenever the squared number exceeds $$N$$, we take the remainder on division by $$N$$.
Let's see this work with finding the remainder on dividing $$7^{13}$$ by $$59$$.
$\begin{array}{|c|c|} \hline m & n \\ \hline 13 & 7\\ 6 & 49\\ 3 & 2401 \rightarrow 41\\ 1 & 1681 \rightarrow 29\\ \hline \end{array}$ giving $29 \times 41 \times 7 = 8323 \rightarrow 4$ which is (phew!) the remainder on dividing $$96889010407$$ by $$59$$.
But here's the punchline. This algorithm for calculating the exponential is what is usually referred to as exponentiation by (repeated) squaring. And this (or some variation on it) is in fact the algorithm used in implementations of public key cryptography (such as RSA) which rely on computing exponentials.
So there we have it. What is essentially an algorithm for multiplication which can be used by people who don't know their times tables turns into an algorithm for exponentiation which underlies contemporary cryptographic methods.

## Acknowledgements

At the 2017 MathsJam Gathering, Elaine Smith and Lynda Goldenberg gave a talk about techniques for multiplication including the Russian peasant method. That was when the penny dropped for me that the repeated squaring algorithm was an adaptation of the Russian peasant multiplication, and I decide it was worth sharing the realisation. About twenty minutes lates, Oliver Masters gave a talk about the Fibonacci matrix where he mentions the repeated squaring algorithm for matrix exponentiation. Fortunately he didn't relate it to the multiplication algorithm. About five minutes after that, Colin Wright (@ColinTheMathmo) was wrapping up the talks and he mentioned the relationship. But by that time I had the bit between my teeth and had decided to do this anyway, so I have.
If you enjoyed this, then you'll probably enjoy @ColinTheMathmo's series of articles about Perrin numbers.

## Thursday, October 26, 2017

### A minus times a minus is a plus

The reason why, we really should discuss.

So, why is the result of multiplying two negatives numbers together a positive number? There are various answers to that.
First, let's think a little about why we care.

# The rules of the (arithmetic) game

We start off with the non-negative integers $\mathbb{N} = \{0,1,2,3,\ldots\}$ which we all learn to manipulate from a very young age. We know how to add and multiply them, and (at least sometimes) to subtract, at least as long as the number we want to subtract is no bigger than the one we want to subtract it from.
Then somebody comes along and says that when you subtract $$2$$ from $$1$$, the answer is $$-1$$: a bit like $$2-1$$, but with a minus sign in front of it. This is a new kind of number called a negative number, and they're useful for situations where you have to subtract a bigger number from a smaller one, and in certain unpleasant circumstances to describe the amount of money you have in the bank.
But now we have to ask: when we do arithmetic with them, how do they work?
One solution is just to present some rules, including the rule that the product of two negative numbers is the same as the product of their positive counterparts. In other words:

# That's just the rule, isn't it?

Ooh, not at all satisfactory. The rule seems completely arbitrary. But did we really have any choice? Are there any constraints that determine this? What if we'd chosen the rule that the product of two negative numbers was negative?
Let's hold this thought for the moment and think about how $$\mathbb{N}$$ behaves.
So, let's take a careful look at the non-negative integers, and what we know about them. In fact, there is a bunch of stuff that we know. I'm too lazy to write in an explicit multiplication symbol, so I'll use $$mn$$ to represent the product of $$m$$ and $$n$$.
In the list below, $$m,n$$ and $$p$$ are arbitrary elements of $$\mathbb{N}$$. Then
1. $$m+n = n+m$$ (addition is commutative)
2. $$m+(n+p) = (m+n)+p$$ (addition is associative)
3. $$mn = nm$$ (multiplication is commutative)
4. $$m(np)= (mn)p$$ (multiplication is associative)
5. $$(m+n)p = mp+np$$ (multiplication distributes over addition)
6. $$0+n = n$$ ($$0$$ is the additive identity)
7. $$1n = n$$ ($$1$$ is the multiplicative identity)
8. If $$m+n = p+n$$ then $$m=p$$ (cancellation law for addition)
9. If $$mn=pn$$ and $$n \neq 0$$ then $$m=p$$ (cancellation law for multiplication)
There are lots of other rules which we could write down, but these are all familiar, and the others follow from them, so we'll take them as a useful working collection.
In particular, there's one very familiar one which is conspicuous by its absence. It is the rule $$0n = 0$$. I left it out on purpose, because we can show that it follows from the others, by the following argument:
We know that $$0+0=0$$, because $$0$$ is the additive identity. But that means that $$(0+0)n=0n$$, no matter what $$n$$ is. By the distributive law, then, $$0n+0n=0n$$, so that $$0n+0n=0n+0$$, and then by the cancellation law, $$0n=0$$.
There is an important lesson hiding in here. It is that our familiar rules aren't independent: once we have agreed on some of them, others are forced upon us. (So it's a good idea to choose a useful collection of rules to start with!)
This should raise an ugly possibility. What if my collection of rules isn't even consistent to begin with? How can I check that?
That's actually quite hard. Let's settle for the moment by accepting that there really is a set of non-negative integers with operations of addition and multiplication which satisfies the above rules. So there are lots of consequences from them that I haven't written above, but at least it all makes sense.
All this doesn't really help yet, but I can make use of it. Whatever the negative integers are, I want arithmetic to keep obeying the same rules as before. I want $$0$$ and $$1$$ to have their familiar properties, and I want the usual laws of algebra to continue to hold. So for every positive integer $$n$$ I introduce a negative integer $$-n$$ with the property that $$-n+n=0$$. The set of all integers, positive, zero, and negative, is denoted $$\mathbb{Z}$$. Now we can investigate what happens if we insist that all the rules above continue to hold.
Then $$\ldots$$

# It follows from the algebra

Once we have all the stuff above, it then follows by a fairly short argument that $$(-1)(-1)=1$$, if we insist that all the above rules continue to hold when we allow $$m,n$$ and $$p$$ to be negative.
First, we all agree that whatever we mean by $$-1$$, it has the property that $$-1+1=0$$. Then we know that $(-1+1)(-1) = 0(-1) = 0$ and so that $\begin{split} (-1+1)(-1)&=(-1)(-1)+1(-1)\\ &=(-1)(-1)+(-1)\\ &=0 \end{split}$ But now adding $$1$$ to each side gives $(-1)(-1)+(-1)+1=0+1$ so that $(-1)(-1)+ 0 = (-1)(-1)= 1$ So if we insist that all the algebraic rules for non-negative numbers continue to hold, we have no option. A minus times a minus is a plus.
Well, now we can see that if we decide that a minus times a minus is minus, then the ordinary rules of algebra must break somewhere. So this isn't really an arbitrary rule, it's forced upon us by the requirement that the algebraic properties of $$\mathbb{Z}$$ are the familiar algebraic properties of $$\mathbb{N}$$.
We can see various other things, too: one very important one is that $$(-1)n = -n$$. The argument is fairly similar: $\begin{split} (-1)n & =(-1)n + 0\\ &= (-1)n + n + (-n)\\ &= (-1)n+1n + (-n)\\ &=(-1+1)n + (-n)\\ &=0n + (-n) \\ &= -n \end{split}$
So, this tells us that if we extend the non-negative integers to include negative ones, and insist that algebra works as before, there are certain logical consequences, including the familiar a minus times a minus is a plus. If. But how do we know that these rules describe anything at all? What are these negative numbers that we have introduced, and how do we know that it is even possible to introduce negative numbers while preserving the algebraic rules?
You might be tempted to think I'm about to kick up a lot of dust and then complain about how hard it is to see, but when the dust settles the view is wonderful.
So here's the awkward question $$\ldots$$

# What is a negative number anyway?

Negative numbers are there to be the solutions to equations of the form $m+x = n$ when $$m>n$$. In particular, $$-n$$ is the solution to $$n+x=0$$, if this makes sense.
So, let's make sense of it.
We're going to do this a bit indirectly, thinking about pairs of integers $$(m,n)$$. Secretly we want to think of this pair as $$m-n$$, but we don't yet have a meaning for this when $$m \lt n$$.
So we're going to lump together certain pairs.
We say that the pairs $$(m,n)$$ and $$(p,q)$$ are equivalent if $$m+q=n+p$$. A collection of pairs which are all equivalent to each other is called an equivalence class.
This is actually quite a familiar idea: it's analogous to thinking of $$3/4$$ and $$6/8$$ as different ways of representing the same fraction. There are infinitely many pairs of integers $$(m,n)$$ which we can use to do it, namely all the ones satisfying $$3n=4m$$. We don't (often) talk explicitly about a fraction being an equivalence class of pairs of integers, but that's what's really going on.
We're going to do the same trick, but with pairs where it's the difference than matters, not the ratio.
We need to do three things now:
1. Define addition on these equivalence classes of pairs.
2. Define multiplication on these equivalence classes of pairs.
3. Show how we can think of these equivalence classes as the familiar integers.

This is the easy one.
We define addition on pairs by $(m,n)+(p,q)=(m+p,n+q)$
But we're not done yet, because the collection that the answer lies in should only depend on the collections of the original pairs, not on the specific pairs. We can check this.
So suppose that $$(m,n)$$ and $$(M,N)$$ are equivalent to each other, as are $$(p,q)$$ and $$(P,Q)$$. Then we need to check that $$(m+p,n+q)$$ is equivalent to $$(M+P,N+Q)$$.
First, we know that $m+N=n+M$ and $p+Q=q+P$ so adding these together we have $m+N+p+Q=n+M+q+P$ i.e. $(m+p)+(N+Q)=(n+q)+(M+P)$ which is exactly what we need.
So this operation of addition works in the sense that the equivalence class of the sum depends only on the equivalence classes of the pairs we add.
Multiplication is a bit tougher.

## Multiplication

This is not so obvious, so let's cheat a little. We are thinking of $$(m,n)$$ and $$m-n$$, and $$(p,q)$$ as $$p-q$$. Then the product ought to be $\begin{split} (m-n)(p-q) &= mp-mq-np+nq\\ & = (mp+nq)-(mq+np) \end{split}$ So we use the definition $(m,n)(p,q)= (mp+nq,mq+np)$ We can quickly check that $\begin{split} (p,q)(m,n) &= (pm+qn,qm+pn)\\ &=(mp+nq,mq+np)\\ &= (m,n)(p,q) \end{split}$ so that this is commutative, but now we have to check that the result is also independent of the choice of pairs.
So we consider $$(m,n)$$, $$(p,q)$$ and $$(P,Q)$$ where $$(p,q)$$ and $$(P,Q)$$ are equivalent. Then $(m,n)(p,q) = (mp+nq,mq+np)$ and $(m,n)(P,Q) = (mP+nQ,mQ+nP)$ and we see that $\begin{split} mp+nq+mQ+nP &= m(p+Q)+n(q+P)\\ &=m(P+q)+n(Q+p)\\ &=mP+nQ+mq+np \end{split}$ so that the two answers are equivalent.
Nearly there.
In just the same way, if $$(m,n)$$ is equivalent to $$(M,N)$$, then $$(m,n)(p,q)$$ is equivalent to $$(M,N)(p,q)$$. Putting this together, $$(m,n)(p,q)$$ is equivalent to $$(M,N)(p,q)$$, which is equivalent to $$(M,N)(P,Q)$$, so the equivalence class of the product only depends on the equivalence classes of the pairs we are multiplying.
Now we know that we have a well-defined arithmetic on our (equivalence classes of) pairs. But we still need to see how we can think of this as supplementing the non-negative integers with the negative integers.
We make use of the fact that we can choose any representative we like for each equivalence class. So we use the representative of the form $$(n,0)$$ for all pairs of the form $$(p+n,q)$$ where $$n \geq 0$$, and the representative of the form $$(0,n)$$ for the pairs of the form $$(p,q+n)$$ where $$n \gt 0$$.
Using this, we can now see that $$(m,0)+(n,0)=(m+n,0)$$ and $$(m,0)(n,0)=(mn,0)$$, so the non-negative integers live inside this strange new object.
We also see that $$(n,0)+(0,n)=(n,n)$$, which is equivalent to $$(0,0)$$. If we think of $$(n,0)$$ representing $$n$$, then $$(0,n)$$ represents $$-n$$.
It is then tedious but straightforward to check that all the rules of algebra for the non-negative integers still work. (This is mathematics code for 'I can't be bothered to write it all out, but I don't see any reason why you shouldn't suffer.')
So, now we know that is is, indeed, possible to add negative integers into the mix while preserving the well-known algebraic structure of the non-negative integers, and that the argument up above for why $$(-1)(-1)=1$$ does actually apply.
Of course, now that we have defined multiplication of our pairs of positive numbers, we can also do this explicitly and see the same answer from an entirely different perspective. Given than $$-1$$ is represented by $$(0,1)$$, we have $(0,1)(0,1)=(0+1,0+0)=(1,0)$ and $$(1,0)$$ represents $$1$$ in our model.

# What was the point of all that again?

After all this, you might wonder what the point of all that peculiar stuff about equivalence classes was. It just made a complicated picture of the positive and negative numbers, which everybody was happy with anyway. The snag is that you can't always just add a new kind of number into the mix and preserve the algebra that you like, but now we know that it can be done in this particular case of interest.
Maybe the most famous example of this is Hamilton's search for a three-dimensional version of complex numbers. You can start off with the real numbers, then include a new quantity $$i$$ which satisfies $$i^2=-1$$, and just do algebra with combinations of this $$i$$ and the real numbers, and it all works: complex numbers make sense, and we can think of their algebra as a way of adding and multiplying two-dimensional vectors.
Hamilton spent some time trying to add in another quantity like $$i$$ so get a three-dimensional version of algebra. In fact, that isn't possible. He eventually realised that it could almost be done in four dimensions, but you had to give up on the multiplication being commutative.
So there are problems with adding the new kind of numbers in and just assuming that the old rules of algebra will continue to hold. But now, after quite a lot of work, we can say with confidence that it is possible to introduce negative integers in a way which is compatible with the algebra of the non-negative ones. This is a Good Thing.

## Food for thought

1. I wrote down a multiplication rule which was motivated by me knowing how the answer should behave. Could I have chosen another rule with difference consequences?
2. I wrote down a bunch of rules above, which are followed by the non-negative integers. But are there other sets of quantities which satisfy those same rules? What happens if I try this trick on these other sets?

## Wednesday, August 2, 2017

### Schroedinger's cat: not one thing and the other.

Schroedinger's cat is not alive, is not dead, and is not simultaneously alive and dead. It's more interesting than that.

# Schroedinger's Cat

We're all familiar with the Schroedinger's cat (thought) experiment. A cat is placed in a box along with a sample of a deadly poison and a radioactive sample all hooked together in such a way that at a certain time the probability of the sample decaying, and releasing the poison resulting in a dead cat is exactly $$1/2$$. The question then is, what is the state of the cat at just that time? The standard understanding of quantum mechanics tells us that if we have a lot of identical boxes of this type and we look inside them all, then half of them will contain a live cat and the other half will contain a dead cat. But what is the state of the cat before the box is opened?

It is quite common to see statements to the effect that the cat is simultaneously dead and alive up until the moment the box is opened and it is observed. But observing the cat forces it into one state or the other, and so we never see a cat simultaneously alive and dead: but if we do the experiment many times, we expect just to see a live cat half of the time and a dead cat the other half.

There are a couple of questions which spring pretty irresistibly to mind. The first is, what on earth does it mean for the cat to be simultaneously dead and alive? And the second is, if it is in fact simultaneously dead and alive, why don't we ever see that?

## Decoherence-a digression

Let's address the second question first. One current understanding is that when the interior of the box interacts with the outside world, the state of the interior is forced into a well-defined classical state by a phenonemon known as decoherence. This gives a purely physical explanation for the 'collapse of the state', a central aspect of the Copenhagen interpretation of quantum mechanics, avoiding the earlier speculations about a conscious observer being required, and the consequent problems raised by Wigner's friend.

But this is not an easy question, nor one with a universally accepted answer. The issue has been vigorously debated for many years and will almost certainly continue to be for many to come. I am going to take the coward's way out, and avoid the question entirely. Instead, I will consider the notion of a state and what kind of state describes the cat in the box.

From now on, then, when I talk about 'observation', you can think about 'what I see when I look at the system' or 'the state of the system when decoherence due to interaction with the exterior world results in a classical state', as you prefer.

# Quantum states

In the quantum mechanics picture of the world, the state of a system is described by a vector: this contains all the information that there is about the system, and the results of any measurement that can be made. In (fairly) general, the state of a system can be written $|\psi\rangle = \sum_i \alpha_i | \psi_i \rangle$ where each of the $$|\psi_i\rangle$$ is a vector corresponding to a classical state of the system, such as 'alive' or 'dead' for the cat, or 'decayed' or 'not decayed' for the radioactive sample. The $$|\psi_i\rangle$$ have unit length and are orthogonal to each other, and the $$\alpha_i$$ are complex numbers whose squared moduli add up to $$1$$. Finally, the probability of the system being in the $$i$$th state when observed is $$|\alpha_i|^2$$.

The use of the notation $$| \;\; \rangle$$ to represent a state-vector is due to Dirac, and is part of a fairly awful pun he was very fond of. If you think of $$|\psi\rangle$$ as a column vector, then the corresponding row vector is $$\langle \psi |$$. Dirac called the row vector a bra, and the column vector a ket, so that when you take the inner product of two vectors $$|\psi\rangle$$ and $$|\phi\rangle$$ you get $$\langle \psi | \phi \rangle$$, a bra-ket or bracket. I said it was awful.

There's a long story here which I'm not going into. As time passes, the $$\alpha_i$$ evolve according to Schroedinger's (differential) equation, which relates the rate of change of the $$\alpha_i$$ to the surroundings. The details of how this works are, fortunately, irrelevant here.

In addition, there can be more than one way of expressing the state $$|\psi\rangle$$ as a combination of classical states, and the relationship between these expressions is tied to the famous uncertainty principle of Heisenberg. We are doubly fortunate, because we don't need to worry about that here either.

# The cat's state

So, what's actually going on inside the box? We have the radioactive sample, which at a certain time is in a quantum state which means that an observation will show decayed or not decayed with equal probability. Such a microscopic system being in a weird quantum state doesn't raise any immediate alarm bells, it's when the state of a macroscopic object (such as a cat, and whether the cat is alive or dead) is weird in this way that we start to worry.

Looking at this again, it seems reasonable that we can identify the two classical states of the radioactive samples as 'decayed' and 'not decayed', and the relevant aspects of the cat's state as 'dead' (if the sample has decayed) and 'alive' (if the sample has not yet decayed). I'll use $$|a\rangle$$ to represent the state 'alive' and $$|d\rangle$$ to represent 'dead'.

So if the cat is in the state $$|a\rangle$$ then the cat will certainly be alive when it is observed; likewise, if it is in the state $$|d\rangle$$ then it will certainly be dead when observed. So you can see that it is not true that observation of a quantum system unavoidably changes its state. That depends on the state and the type of observation: the uncertainty principle is much more fundamental than the idea that poking a system to see where it is pushes it about a bit. But that's another story.

But in general the cat is not in either of these states. Instead, it is in some state of the form $|\psi\rangle = \alpha |a\rangle + \beta |d\rangle)$ where $$|\alpha|^2+|\beta|^2=1$$. In this state, the probability that the observation reveals a live cat is $$|\alpha|^2$$, and the probability of a dead cat is $$|\beta|^2$$. In the classical thought experiment, we have both of these probabilities equal to $$1/2$$, so one possible state would be $|\psi\rangle = \frac{1}{\sqrt{2}}(|a \rangle + |d \rangle)$

So what is the state of the cat here? It is not alive. That would mean it was in the state $$|a\rangle$$. It is not dead. That would mean it was in the state $$|d\rangle$$. It certainly isn't simultaneously alive and dead, any more than I can be simultaneously in the living room and in the hallway. It also isn't in some kind of intermediate state, such as when I am standing in the doorway with one foot in the living room and one in the hallway. It is in a different kind of state, one without a classical analogue, and this kind of state is called a superposition.

To say that the cat is simultaneously alive and dead is like saying that somebody pointing north-east is simultaneously pointing north and east. They aren't: but the direction they are pointing in has a northern and an eastern component. In the same way, the cat isn't simultaneously alive and dead: but the state it is in has an 'alive' and a 'dead' component.

This actually has enormous consequences. The different components of the state can evolve indepenently, almost as if there are two universes involved, in one of which the cat is alive, and in the other of which it is dead. Developing this idea leads to the many worlds, or Everett-Wheeler model of quantum mechanics, which Sean Caroll describes here.

In any case, if we take quantum mechanics seriously, and the evidence for it it is pretty compelling, then we have to learn to live with and accept (even if we can never be entirely comfortable with) the idea that the state of a system might not be a classically recognized state, but rather a superposition of such states. We even have to accept that the acceptable states may not quite correspond to what we think of as classically acceptable states. They are more general in some respects (we have to allow for superpositions of classical states), but more restricted in others (there is no quantum state for a particle with a simultaneously well-defined position and momentum).

# Decoherence again

All this has been about what the quantum state of the (unobserved) cat in the box is. But maybe you aren't any happier with the cat being in a superposition of states than with the cat somehow being in two states at once. There is a possible way of avoiding this, and it is to argue that when the radioactive sample interacts with the rest of the apparatus in the box (including the cat), that already causes decoherence, so at box opening time what we actually have is a box with either a dead cat or a live cat in it: or again, if we did it many times, a collection of boxes of which about half contain live cats and half contain dead one.

It would be easy to give the impression that decoherence is supposed to be the cure for all our problems with why macroscopic systems look classical, rather than quantum. It would also be wrong: I'm not going to go into any detail at all here, but if you've made it this far you can probably get quite a lot from reading (at least substantial chunks of) Schlosshauer's review article.

## Sunday, July 9, 2017

### The RIGHT way to think about derivatives

At secondary school and then university, we meet a variety of definitions of the derivative, all motivated by the fundamental idea that the derivative is telling us something about the rate of change of some quantity as another varies. The definitions are likely to start with a fairly intuitive one and become more rigorous, incorporating the notion of limit. Unfortunately, as the rigour and degree of precision grow, it is possible for the student to become lost in a thicket of notation, and to lose sight of just what they are studying. However, I claim that if one thinks of the derivative the right way, then one has a fully rigorous concept of derivative which keeps fruitful contact with the intuitive one.

Let's take a look at (some of) these definitions, before settling on the right one.

## The slope of the tangent to a graph

This is a fairly straightforward idea. You start off with the graph of a function, say (just to be conventional) $$y=f(x)$$, and at some chosen value of $$x$$ you draw the tangent to the curve. Then the gradient is how fast the height of the graph changes as $$x$$ changes, so that sounds good.

At least, it sound good until you start thinking about just what you might mean by 'tangent'. If the curve is a circle, that's pretty unambiguous: you take the line perpendicular to the radius at the point of interest, and we all know that is what is meant by the tangent.

Actually, since I am talking about the tangent to a graph, I've just used the upper half of the circle, given by $$y=\sqrt{1-x^2}$$ for $$-1 \leq x \leq 1$$.

But what if the graph isn't (part of) a circle?

We could try 'the tangent to $$y=f(x)$$ at $$(x,f(x))$$ is the line that touches the graph at just the point $$(x,f(x))$$, without crossing it'. That works for circles, but now also includes some other well-known curves, including the parabola and hyperbola.

So we have generalized the notion of tangent usefully, to include some new curves. But what about this?

It looks like what we want, but cuts the graph in more than one place. But somehow that looks all right: the 'bad' intersection is a long way away from the point where we want the tangent. Clearly being a tangent is a local propery, so we can fix this by saying 'the tangent to $$y=f(x)$$ at $$(x,f(x)$$) is line that touches the graph at just one point in the vicinity of $$(x,f(x))$$ without crossing it'.

The straight line certainly looks like a tangent, except that it crosses the graph at the point of tangency. There's no obvious way of retaining the notion of 'touching the graph without crossing it' here.

So eventually we accept that this is going to be hard to fix, and that something a little subtler is required.

## The chord slope definition

Instead of thinking directly about the tangent to a point on the graph, we think of an approximation to it, and see how that can lead to a notion of tangent.

So although we want the tangent at $$(x,f(x))$$ we decide to look instead at the straight line joining two points on the graph: $$(x,f(x))$$ and $$(x+h,f(x+h))$$, where we insist that $$h \neq 0$$.

The gradient is $\frac{f(x+h)-f(x)}{(x+h)-x} = \frac{f(x+h)-f(x)}{h}$ and it seems entirely reasonable (because it is) to say that the slope of the tangent is the limiting value of this as $$h$$ gets closer and closer to $$0$$. So the tangent to $$y=f(x)$$ at $$(x,f(x))$$ is the line through $$(x,f(x))$$ with this gradient. At first exposure, we might not go into too many details of just what we mean by the limiting value, sticking with examples where no obvious complications arise.

This is a meaningful definition. Using it we can calculate the derivative of a variety of simple functions, such as powers of $$x$$ and, with the help of a little geometric ingenuity, the trig functions $$\sin$$ and $$\cos$$.

It's also a sensible definition, because it really does seem to do the job we want to do. It doesn't make sense to say it's 'correct', because this defines the notion. But it does have the properties we'd hope for, and it's something we can calculate with. It's worth noticing that in the process we have shifted from defining the derivative in terms of the tangent to defining the tangent in terms of the derivative!

I claimt that even so, it isn't the right way to think about the derivative.

## The formal limit definition

After spending some time with the chord slope definition, and informally working out some simple limits, it's usually time to give a more rigorous idea of what is going in here.

We then introduce the following incantation: $\lim_{x \to a}f(x) = L$ means $\forall \epsilon>0, \exists \delta>0 \text{ such that } 0 \lt |x-a| \lt \delta \Rightarrow |f(x)-L| \lt \epsilon$ At first sight, and especially for the novice, this is likely to cause a great deal of fear and trembling. It is a complex statement which makes very precise a simple idea. The idea is that we can make $$f(x)$$ as close as we want to $$f(a)$$ by making $$x$$ as close as we have to to $$a$$.

With this in hand, we can tighten up some of what is done with the chord slope idea by writing $f'(x)= \lim_{h \to 0} \frac{f(x+h)-f(x)}{h}$ It isn't really anything new, it's just a more precise way of stating the previous idea.

There's also the minor variation that $$f$$ has derivative $$f'(x)$$ at $$x$$ if $\lim_{h \to 0} \left| \frac{f(x+h)-f(x)}{h} - f'(x) \right| = 0$

Working with these, including showing that they are equivalent, gives some practice in understanding the formal statement of what is meant by a limit, and how that formal definition is used. But I also claim that this is not the right way to think about the derivative.

## Best Linear Approximation

Once the derivative has been defined and calculated in some tractable cases, various theorems about derivatives are presented. Here's one that gets used a lot. $f(x+h) \approx f(x)+hf'(x)$ or rather, with some foresight, $f(x+h)-f(x) \approx f'(x)h$ Now, as it stands, this says less than you might think. As long as the function $$f$$ is continuous at $$x$$, then for very small $$h$$, $$f(x)$$ and $$f(x+h)$$ are very close together, and $$f'(x)h$$ is also very small, so the the approximation gets better and better as $$h$$ approaches $$0$$, no matter what value we use for $$f'(x)$$. But this is just saying that both sides are approaching $$0$$, and this obviously isn't all we mean by that approximate formula.

Let's tease it out by looking a little more carefully. Instead of having an approximate equality, we will have an exact equality which includes an error term. $f(x+h) - f(x) = f'(x)h+e(h)$ where $$e(h)$$ is the error, which in general will depend on the size of $$h$$. Now, it turns out that if $$f$$ is differentiable at $$x$$, with derivative $$f'(x)$$, then as $$h$$ gets closer and closer to $$0$$, not only does $$e(h)$$ get arbitrarily close to $$0$$, but $$e(h)/h$$ also gets arbitrarily small. When this happens I say that $$e(h)$$ is suitably small.

In other words, for very small $$h$$, the error is a very small fraction of $$h$$, and we can make that fractional error as small as we want by choosing $$h$$ small enough.

It's this behaviour of the error term that makes it all work. If we choose any other value, say $$K$$ instead of $$f'(x)$$ and try to use that instead, we have $f(x+h) - f(x) = Kh + E(h)$ where $$E(h)=(f'(x)-K)h +e(h)$$. But for very small $$h$$, we see that $\frac{E(h)}{h}=f'(x)-K + \frac{e(h)}{h}$ and we cannot make this small by making $$h$$ small. So the special property of $$f'(x)$$ is that $$f'(x)h$$ is the best linear approximation to $$f(x+h)-f(x)$$.

To make contact with the usual limit definition, we note that there is a best linear approximation if and only if the error term is suitably small.

I claim that this is the best way to think about the derivative.

# What's so great about it?

The first thing to say is, it doesn't make it any easier to calculate a derivative. That's still just the same as it always was.

What's so great about it is that it gives insight into how derivatives behave. This is not to say that it makes the proofs of behaviour easier: but it does make the results easier to see and understand.

## Differentiation rules

We learn the various rules for differentiating combinations of functions in our first calculus course: linearity, the product, quotient and chain rules. Let's see how this point of view does with them.

In each case, we'll think about two functions $$f$$ and $$g$$, and use $$e_f$$ and $$e_g$$ for the corresponding errors.

### Linearity

Suppose we have two real numbers $$\alpha$$ and $$\beta$$. Then $\begin{split} (\alpha f+\beta g)(x+h)-(\alpha f+\beta g)(x) &= \alpha f(x+h)+ \beta g(x+h) - \alpha f(x)-\beta g(x)\\ &=\alpha(f(x+h)-f(x))+\beta(g(x+h)-g(x))\\ &= \alpha (f'(x)h+e_f(h))+\beta(g'(x)h+e_g(h)\\ &=(\alpha f'(x)+\beta g'(x))h +\alpha e_f(h)+\beta e_g(h)\\ &=(\alpha f'(x)+\beta g'(x))h + e(h) \end{split}$ where $$e(h)$$ is obviously suitable small.

So differentiation is linear.

### Product rule

$\begin{split} (fg)(x+h)-(fg)(h) &= f(x+h)g(x+h)-f(x)g(x)\\ &= (f(x)+f'(x)h+e_f(x))g(x+g'(x)h+e_g(x))-f(x)g(x)\\ &= (f'(x)g(x)+f(x)g'(x))h +f'(x)he_g(x)+g'(x)he_f(x)\\ &= (f'(x)g(x)+f(x)g'(x))h + e(h) \end{split}$ where $$e(h)$$ is still fairly obviously suitably small.

### Quotient rule

Actually, I'll cheat slightly. Assuming that $$f(x) \neq 0$$, we have $\begin{split} \frac{1}{f(x+h)} - \frac{1}{f(x)} &= \frac{1}{f(x)+f'(x)h+e_f(h}-\frac{1}{f(x)}\\ &=\frac{-f'(x)h-e_f(h)}{f(x)(f(x)+f'(x)h+e_f(h))}\\ &= -\frac{f'(x)}{f(x)(f(x)+f'(x)h+e_f(h))} - \frac{e_f(h)}{f(x)(f(x)+f'(x)h+e_f(h))} \end{split}$ which is less obviously, but quite plausibly $-\frac{f'(x)}{f(x)^2}h + e(h)$ where $$e(h)$$ is suitably small.

Then the quotient rule is just the product rule combined with this one.

### Chain rule

And finally, we have $\begin{split} f(g(x+h))-f(g(x)) &= f(g(x)+g'(x)h+e_g(h))-f(g(x))\\ &= f(g(x))+f'(g(x))(g'(x)h+e_g(h)) -f(g(x))\\ &= f'(g(x))g'(x)h+e(h)) \end{split}$ where again it is not immediately obvious, but is quite plausible that $$e(h)$$ is suitably small.

### Insight, not proof

So we can see that in each case, assuming that the behaviour of error terms is not unreasonable, this idea of the derivative as the best linear approximation to the change in the function value gives us insight into why the various rules for differentiation work as they do.

## But there's more

This works very well. The mental picture of a best linear approximation is much more meaningful than that of some limiting value, and the arguments involving it are rather more intuitive than those involving the usual limit definition.

But, as is often the way with a good idea, we get more than we seem to have paid for.

### Functions of several variables

Suppose that now instead of having a function $$f:\mathbb{R} \to \mathbb{R}$$ we have $$f:\mathbb{R}^n \to \mathbb {R}$$. The picture is much the same except that $$x$$ and $$h$$ are now vectors with $$n$$ components, and I will adopt the standard convention that they are represented as column vectors. So we can still say that $$f$$ is differentiable at $$x$$ with derivative $$f'(x)$$ if this $$f'(x)$$ has the property that $f(x+h)-f(x)=f'(x)h +e(h)$ where $$e(h)$$ is suitably small.

But now we can think about what sort of thing this $$f'(x)$$ actually is. $$h$$ is a vector with $$n$$ components, and so $$f'(x)$$ must be a linear mapping from the space of these vectors to $$\mathbb{R}$$, in other words it must be a row vector, or covector, with $$n$$ components.

This is where we first see why I wrote $$f'(x)h$$ rather than $$hf'(x)$$ previously. When $$f$$ was a real valued function of a real variable, it made no difference, but now the order is significant, and is determined by the requirement that we end up with a scalar quantity.

The next obvious question is, what are these components of $$f'(x)$$? We can answer this by choosing $$h$$ carefully. To pick out components of vectors, I'll use a superscript for a component of a column vector and a subscript for a component of a row vector. Then we have $f(x+h)-f(x) = f'(x)h +e(h) = f'(x)_i h^i +e(h)$ which tells us that $f'(x)_i = \frac{\partial f}{\partial x^i}(x)$

So our definition of 'derivative' naturally produces the object we learn to call the gradient of $$f$$ in vector calculus (and maybe gives a little more explanation for the name).

In fact, it tells us more. It tells us that partial derivatives are really components of the best linear approximation, i.e. the actual derivative of the function. The partial derivatives are not best thought of as objects which we define as ordinary derivatives 'holding all but one variable constant' but as components of a (row) vector naturally associated with the function.

Again, thinking of the derivative as the best linear approximation gives us a deeper insight into a chunk of familiar calculus. As a bonus, it also gives us our first glimpse of the fact that the derivative of a function of this type is not a vector, but is a covector. This is a distinction that grows in importance as one generalizes from functions on $$\mathbb{R}^n$$ to functions on a surface in $$\mathbb{R}^n$$, and finally to functions on a manifold which may or may not be equipped with a (semi-)Riemannian metric. It is a gift which keeps on giving.

### Tangent (velocity) vector

In the last example, we looked at functions which associate a value to each point of space. This time, we'll do the opposite, and think about functions which associate a point in space to each value of some parameter which we can think of as time. So we have functions of the form $$f:\mathbb{R} \to \mathbb{R}^n$$, and use $$t$$ for the argument of the function.

We can now play the same game and see where it takes us. $$f$$ is differentiable with derivative $$f'(t)$$ if $f(t+h)-f(t) = f'(t) h + e(h)$ where $$e(h$$ is suitably small.

This time, we see that $$h$$ is just a number, and so $$f'(t)$$ really is a vector with $$n$$ components. And by looking at the components of $$f(t)$$, we see that $f'(t)^i = \frac{d f^i}{dt}(t)$ and so the derivative of $$f$$ really is the velocity of a point whose position is given by as a function of $$t$$ by $$f(t)$$.

It's useful to note that the character of $$f'$$, i.e. whether it is a scalar quantity, or a vector, or a covector, is entirely determined by the dimensions of the domain and codomain of $$f$$. We don't choose whether to make a vector or a covector out of the derivatives of $$f$$; that is decided by the fact that $$f'$$ is the best linear approximation.

### The whole hog

Let's combine the two ideas up above, and consider $$f:\mathbb{R}^n \to \mathbb{R}^m$$, so now $$f(x)$$ is a vector with $$m$$ components and $$x$$ is a vector with $$n$$ components. As always, $$f$$ is differentiable with derivative $$f'(x)$$ if $f(x+h)-f(x) = f'(x)h + e(h)$ where $$e(h)$$ is suitably small.

Now, $$f'(x)$$ is a linear transformation which takes a vector with $$n$$ components and produces a vector with $$m$$ components. In other words, it is an $$m \times n$$ matrix. Using my previous notational device of labelling rows with superscripts and columns with subscripts, we now have $f'(x)^i_j = \frac{\partial f^i}{\partial x^j}$ and the hidden agenda behind the placement of the indices now becomes slightly exposed. (But trust me, there is plenty still lurking behind the tapestry.)

So the power of thinking of the derivative as the best linear approximation should now be rather clearer: we can't always use the chord-slope definition, though we can generally pick an aspect of the set-up where it can be used to define, say, partial derivatives. But the best linear approximation makes sense in great generality, and the familiar objects built out of partial or component derivatives can now be seen as the natural objects, and the partial or component derivatives as analogous to the components of vectors.

### Combinations

In general, we can't multiply, divide, or compose functions. But whenever we can, the rules for differentiating combinations work in just the same way. In fact, if you revisit the (plausibility) argument for each rule, you can see that all that matters is that the algebraic operation makes sense. Although they were written with functions of the form $$\mathbb{R} \to \mathbb{R}$$ in mind, in fact that is irrelevant.

Probably the most powerful aspect of this comes from the chain rule. It tells us that the derivative of the composition is the product (in general, a matrix product) of the derivatives of the functions. A beautiful consequenc is that if $$f,g:\mathbb{R}^n \to \mathbb{R}^n$$ are inverses of each other, so are the derivatives. So the partial derivatives of $$f$$ are not the inverses of the partial derivatives of $$g$$ (a standard error of students in advanced calculus courses), but the matrices built out of the partial derivatives are in inverse relation to one another.

## Algebra for and from calculus

So we are just beginning to see that what we know about linear algebra (vector spaces, dual vector spaces, linear transformations) is relevant to calculus. Since the derivative is the best linear approximation, and derivatives of combinations of functions are linear algebraic combinations of the derivatives, we can use the apparatus of linear algebra to tell us about the (local) behaviour of non-linear functions.

In some ways, this is the entire point of differential calculus: it is a tool for turning the difficult analysis of nonlinear functions into the relatively easy analysis of linear ones. The identification of the derivative as the best linear transformation is what makes this all work.

This all extends to even more general situations. The spaces involved don't actually have to be finite dimensional, as long as there is a way of talking about the size of the vectors.

It leads on to, for example, ways of considering fluid flow, where the velocity field of the fluid can be thought of as the tangent velocity to a curve in an infinite dimensional space of fluid configurations, and of thinking about the Euler-Lagrange equations in the calculus of variations as the derivative of a functional, to name just two.

## Simply the best?

This gives some of the reasons I think that the right way to think of the derivative is as the best linear approximation. It both gives insight into what the rules for derivatives of combinations of functions are actually saying, and makes the various generalizations easier to understand. It lets us see how we can use linear algebra to understand aspects of the behaviour of nonlinear functions. It generalizes still further, in ways that I have just hinted at.

But it isn't the right way for everybody. I don't think it's the right way for it to be introduced to novices, or indeed for novices to try to think about it. It really makes sense best when you already have a bunch of different-looking bits of differentiation at hand, which seem to be myusteriously linked together, when it provides a unifying framework to remove the mystery. Then it lets things drop into place together, and you can see how they're all really different aspects of the same thing. But you do have to go through a certain amount of getting to grips with the more traditional approaches in order to have a mental place for this to live.