For a while now I have had an interest in information geometry. The maxims that geometry is intuitive maths and information theory is intuitive statistics seem pretty fair to me, so it’s quite surprising to find a lack of easy to understand introductions to information geometry. This is my first attempt, the idea is to get an geometric understanding of the mutual information and to introduce a few select concepts from information geometry.
Mutual information
Most people that have heard of it will know that the mutual information is a measurement of how correlated two variables are. It is usually written as:
Image may be NSFW.
Clik here to view.
but it can also be written as a kind-of-distance-measure (a Kullback-Leibler divergece) between a joint distribution Image may be NSFW.
Clik here to view. and a distribution where the probabilities of each Image may be NSFW.
Clik here to view. or Image may be NSFW.
Clik here to view. are the same as in Image may be NSFW.
Clik here to view. but the variables X and Y are independent; in other words, to the product of the marginal distributions Image may be NSFW.
Clik here to view.. The distance measure, the Kullback-Leibler divergence between distributions Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view. is usually written Image may be NSFW.
Clik here to view. and is equal to (for discrete variables)
Image may be NSFW.
Clik here to view.
Which allows us to write the mutual information as
Image may be NSFW.
Clik here to view.
with the sum in Image may be NSFW.
Clik here to view. being over both Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view..
For the example here I will look at the mutual information between two binary variables; Image may be NSFW.
Clik here to view.: which can be either Image may be NSFW.
Clik here to view. or Image may be NSFW.
Clik here to view., and Image may be NSFW.
Clik here to view.: which can be either Image may be NSFW.
Clik here to view. or Image may be NSFW.
Clik here to view.. In this case there is only four different probabilities to consider; as these must sum to one, there are only three degrees of freedom and we can visualize the probability space in three dimensions.
The standard geometry bit
To begin with it is useful to introduce two coordinate systems. Firstly, one that is a ‘natural coordinate system’ in a sense that I will explain later (the Image may be NSFW.
Clik here to view. in Image may be NSFW.
Clik here to view. is a superscript index not a power):
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
in this coordinate system Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view.. The other coordinate system that I will use is one for visualizing things in Cartesian coordinates:
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view., Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view. correspond to probabilities (as do the Image may be NSFW.
Clik here to view.s) so every probability must be inside the cube with edges Image may be NSFW.
Clik here to view., but the space of probabilities is more constrained than this and in fact lies within a tetrahedron made out of alternate corners of this cube.
Image may be NSFW.
Clik here to view.
The cases where the variables Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view. are independent form a surface in the tetraderon. We can get an equation for it from Image may be NSFW.
Clik here to view., in the Cartesian coordinates we have:
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
For the natural coordinates we have:
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
The surface looks like this:
Image may be NSFW.
Clik here to view.Image may be NSFW.
Clik here to view.Image may be NSFW.
Clik here to view.
There is also a set of lines formed by all the distributions that have the same marginal distributions (Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view.). These are lines that are parallel with the Image may be NSFW.
Clik here to view.-axis. Putting all this together we get a picture that looks like this:
Image may be NSFW.
Clik here to view.
Each line intersects with the surface exactly once. Which we can see when we look at it along the Image may be NSFW.
Clik here to view.-axis. You can see that there is an intersection for every possible value of Image may be NSFW.
Clik here to view. (Image may be NSFW.
Clik here to view.) and Image may be NSFW.
Clik here to view. (Image may be NSFW.
Clik here to view.).
Image may be NSFW.
Clik here to view.
Arc lengths (Euclidean distances between points)
The mutual information, as we will see, can be thought of as measuring a distance to the curved surface along one of the lines in Image may be NSFW.
Clik here to view.. This can be done in standard geometry, although it does not yield a standard information measure. The basic idea is, that to measure the length of a curve we add up lots of little bits of a curve.
Image may be NSFW.
Clik here to view.
The picture shows the case of a 2D Euclidean geometry with a curve between Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view. in a plane with coordinates Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view.. We can work out the length of the curve by integrating Image may be NSFW.
Clik here to view. along the curve – Image may be NSFW.
Clik here to view.. Another way of writing this is roughly:
Image may be NSFW.
Clik here to view.
The standard thing to do is to write the coordinates of the curve as a function of some arbitrary measure (Image may be NSFW.
Clik here to view.) of how far along the curve we have gone, usually we make it so that the parameters is Image may be NSFW.
Clik here to view. at the start and Image may be NSFW.
Clik here to view. at the end:
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
Then we can write the integral as
Image may be NSFW.
Clik here to view.
To make this look a bit more like differential geometry we can write the bit in the square root in a different way. If we now number Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view. as dimensions Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view. and give them new names Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view. (the raised numbers are just there to say which one is which, i.e. Image may be NSFW.
Clik here to view. does not mean ‘q squared’, but ‘q number 2′). Expressing the bit in the square root in these terms gives:
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
I have introduced a new thing here, the $\delta$. This is basically the identity matrix, it takes the value Image may be NSFW.
Clik here to view. when Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view. when Image may be NSFW.
Clik here to view.. The benefit of doing this that we change the coordinates to something different and keep the distance the same by changing $\delta_{ij}$ for something else. For example if we have new coordinates Image may be NSFW.
Clik here to view. that are related to Image may be NSFW.
Clik here to view. by Image may be NSFW.
Clik here to view. (read ‘q one squared’) and Image may be NSFW.
Clik here to view. (read ‘q two cubed’) then we replace Image may be NSFW.
Clik here to view. with Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view. with a new quantity Image may be NSFW.
Clik here to view. which is given by (I won’t show the working):
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
Image may be NSFW.
Clik here to view.
The quantity Image may be NSFW.
Clik here to view. is known as the metric tensor (or the coordinate based representation of it). Ultimately, it is the quantity that defines the distance between points in Riemannian geometry. When we change coordinates we have to change the value of Image may be NSFW.
Clik here to view. so as to keep distances stay the same. Really, it is the relationship between Image may be NSFW.
Clik here to view. and the coordinates that defines distances. I’m going to skip how do the coordinate transformations but you can find a description here – it’s not particularly hard but I’ve waffled about this a bit too much already.
So, in general, the distance (arc-length) between two points can be written as:
Image may be NSFW.
Clik here to view. or Image may be NSFW.
Clik here to view.
Later, it will be useful if we call the bit in the square root Image may be NSFW.
Clik here to view. so that:
Image may be NSFW.
Clik here to view.
The metric tensor and divergences
When the Kullback-Liebler divergence was first published, they noticed that approximately (for small Image may be NSFW.
Clik here to view.) that for distributions with parameters Image may be NSFW.
Clik here to view., written here as Image may be NSFW.
Clik here to view.:
Image may be NSFW.
Clik here to view.
where Image may be NSFW.
Clik here to view. is the Fisher information, it is a metric tensor (a valid Image may be NSFW.
Clik here to view.) which can be calculated by:
Image may be NSFW.
Clik here to view.
or equally (though not obviously)
Image may be NSFW.
Clik here to view.
This can be seen from the Taylor expansion, which is a bit of a pain to do, but quite straight forwards. Indeed, we can look at higher order terms too:
Image may be NSFW.
Clik here to view.
It turns out, for the Kullback-Liebler divergence (among others) that if we choose the coordinates Image may be NSFW.
Clik here to view. correctly, then we can recover the divergence by integrating. Basically, we can do this if Image may be NSFW.
Clik here to view.. There are a number of ways of checking this, which can be found in this book, the simplest is checking that:
Image may be NSFW.
Clik here to view.
or for the reverse Kullback-Liebler divergence:
Image may be NSFW.
Clik here to view.
for all Image may be NSFW.
Clik here to view..
It’s like a half square arc-length
From the section on classical geometry, we can see that small arc-lengths Image may be NSFW.
Clik here to view. are approximately given by:
Image may be NSFW.
Clik here to view.
But the divergences are approximately given by
Image may be NSFW.
Clik here to view.
So, the divergence is a kind of half-square-arc-length. But the actual squared arc-length is something different; we can see this by direct calculation:
Image may be NSFW.
Clik here to view.
We can visualize this as an integral over a square with sides of Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view.. If we represent the magnitude of Image may be NSFW.
Clik here to view. as the intensity of a greenness within the square we represent the area that we are integrating to get the square arc-length as:
Image may be NSFW.
Clik here to view.
Notice the symmetry in the pattern that is formed, it is symmetric in the axis where Image may be NSFW.
Clik here to view.. This makes it possible to write the half-square arc-length as the integral over a triangle:
Image may be NSFW.
Clik here to view.
Which is symmetric in the sense that it stays the same if we swap Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view.. We can view this integral in the same way as before.
Image may be NSFW.
Clik here to view.
The divergence on the other hand is calculated by (where Image may be NSFW.
Clik here to view. is the value of Image may be NSFW.
Clik here to view. at Image may be NSFW.
Clik here to view.):
Image may be NSFW.
Clik here to view.
which is asymmetric, there is a ‘forwards’ and ‘backwards’ integral. It looks like
Image may be NSFW.
Clik here to view.
The asymmetry of the divergence is well known. The breaking of this symmetry corresponds to the fundamental difference between information measures and normal geometric measures.
We are now at a point where we can use an information theoretic integral instead of a geometric one to calculate the distance between points. This method has (nearly) all the features of differential geometry, but the quantities are those of information theory.
Back to the Mutual Information
So, let us go back to the description of the mutual information. We know that it is Kullback-Leibler from the joint distribution to the a point with the same Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view. coordinates. But also, we can say a little more. Lines parallel with the Image may be NSFW.
Clik here to view.-axis are straight lines in terms of divergences (unlike normal geometry, this is not when the metric is constant). We can see this by checking that
Image may be NSFW.
Clik here to view.
for all Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view.. This is just the check I mentioned before. This holds because the Image may be NSFW.
Clik here to view., Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view. are linearly related to the probabilities. The result of this is that we can integrate Fisher information in a straight line along the Image may be NSFW.
Clik here to view.-axis in the Euclidean coordinates and get the Kullback-Leibler divergence.
In other words we can write that the mutual information, for a probability written in terms of Image may be NSFW.
Clik here to view.:
Image may be NSFW.
Clik here to view.
where the Fisher information metric, Image may be NSFW.
Clik here to view., is given by:
Image may be NSFW.
Clik here to view.
The fact that the straight lines for the divergence are straight lines in the Cartesian space is important. In a very real sense, we are only comparing the joint distribution (Image may be NSFW.
Clik here to view.) with others with different Image may be NSFW.
Clik here to view. value (different Image may be NSFW.
Clik here to view. – different degrees of independence – but with the same value of Image may be NSFW.
Clik here to view. and Image may be NSFW.
Clik here to view.). Roughly, we could do anything to the geometry of the space that isn’t on the Image may be NSFW.
Clik here to view. line and the divergence would be unaffected.
I think that’s about all I have to say for now. I’m working on a more detailed tutorial for information geometry in general, I’ll post a link when I have finished it.
Image may be NSFW.
Clik here to view.
Clik here to view.
