For a while now I have had an interest in information geometry. The maxims that geometry is intuitive maths and information theory is intuitive statistics seem pretty fair to me, so it’s quite surprising to find a lack of easy to understand introductions to information geometry. This is my first attempt, the idea is to get an geometric understanding of the mutual information and to introduce a few select concepts from information geometry.
Mutual information
Most people that have heard of it will know that the mutual information is a measurement of how correlated two variables are. It is usually written as:
but it can also be written as a kind-of-distance-measure (a Kullback-Leibler divergece) between a joint distribution and a distribution where the probabilities of each
or
are the same as in
but the variables X and Y are independent; in other words, to the product of the marginal distributions
. The distance measure, the Kullback-Leibler divergence between distributions
and
is usually written
and is equal to (for discrete variables)
Which allows us to write the mutual information as
with the sum in being over both
and
.
For the example here I will look at the mutual information between two binary variables; : which can be either
or
, and
: which can be either
or
. In this case there is only four different probabilities to consider; as these must sum to one, there are only three degrees of freedom and we can visualize the probability space in three dimensions.
The standard geometry bit
To begin with it is useful to introduce two coordinate systems. Firstly, one that is a ‘natural coordinate system’ in a sense that I will explain later (the in
is a superscript index not a power):
in this coordinate system and
. The other coordinate system that I will use is one for visualizing things in Cartesian coordinates:
,
and
correspond to probabilities (as do the
s) so every probability must be inside the cube with edges
, but the space of probabilities is more constrained than this and in fact lies within a tetrahedron made out of alternate corners of this cube.
The cases where the variables and
are independent form a surface in the tetraderon. We can get an equation for it from
, in the Cartesian coordinates we have:
For the natural coordinates we have:
The surface looks like this:
There is also a set of lines formed by all the distributions that have the same marginal distributions ( and
). These are lines that are parallel with the
-axis. Putting all this together we get a picture that looks like this:
Each line intersects with the surface exactly once. Which we can see when we look at it along the -axis. You can see that there is an intersection for every possible value of
(
) and
(
).
Arc lengths (Euclidean distances between points)
The mutual information, as we will see, can be thought of as measuring a distance to the curved surface along one of the lines in . This can be done in standard geometry, although it does not yield a standard information measure. The basic idea is, that to measure the length of a curve we add up lots of little bits of a curve.
The picture shows the case of a 2D Euclidean geometry with a curve between and
in a plane with coordinates
and
. We can work out the length of the curve by integrating
along the curve –
. Another way of writing this is roughly:
The standard thing to do is to write the coordinates of the curve as a function of some arbitrary measure () of how far along the curve we have gone, usually we make it so that the parameters is
at the start and
at the end:
Then we can write the integral as
To make this look a bit more like differential geometry we can write the bit in the square root in a different way. If we now number and
as dimensions
and
and give them new names
and
(the raised numbers are just there to say which one is which, i.e.
does not mean ‘q squared’, but ‘q number 2′). Expressing the bit in the square root in these terms gives:
I have introduced a new thing here, the $\delta$. This is basically the identity matrix, it takes the value when
and
when
. The benefit of doing this that we change the coordinates to something different and keep the distance the same by changing $\delta_{ij}$ for something else. For example if we have new coordinates
that are related to
by
(read ‘q one squared’) and
(read ‘q two cubed’) then we replace
with
and
with a new quantity
which is given by (I won’t show the working):
The quantity is known as the metric tensor (or the coordinate based representation of it). Ultimately, it is the quantity that defines the distance between points in Riemannian geometry. When we change coordinates we have to change the value of
so as to keep distances stay the same. Really, it is the relationship between
and the coordinates that defines distances. I’m going to skip how do the coordinate transformations but you can find a description here – it’s not particularly hard but I’ve waffled about this a bit too much already.
So, in general, the distance (arc-length) between two points can be written as:
or
Later, it will be useful if we call the bit in the square root so that:
The metric tensor and divergences
When the Kullback-Liebler divergence was first published, they noticed that approximately (for small ) that for distributions with parameters
, written here as
:
where is the Fisher information, it is a metric tensor (a valid
) which can be calculated by:
or equally (though not obviously)
This can be seen from the Taylor expansion, which is a bit of a pain to do, but quite straight forwards. Indeed, we can look at higher order terms too:
It turns out, for the Kullback-Liebler divergence (among others) that if we choose the coordinates correctly, then we can recover the divergence by integrating. Basically, we can do this if
. There are a number of ways of checking this, which can be found in this book, the simplest is checking that:
or for the reverse Kullback-Liebler divergence:
for all .
It’s like a half square arc-length
From the section on classical geometry, we can see that small arc-lengths are approximately given by:
But the divergences are approximately given by
So, the divergence is a kind of half-square-arc-length. But the actual squared arc-length is something different; we can see this by direct calculation:
We can visualize this as an integral over a square with sides of and
. If we represent the magnitude of
as the intensity of a greenness within the square we represent the area that we are integrating to get the square arc-length as:
Notice the symmetry in the pattern that is formed, it is symmetric in the axis where . This makes it possible to write the half-square arc-length as the integral over a triangle:
Which is symmetric in the sense that it stays the same if we swap and
. We can view this integral in the same way as before.
The divergence on the other hand is calculated by (where is the value of
at
):
which is asymmetric, there is a ‘forwards’ and ‘backwards’ integral. It looks like
The asymmetry of the divergence is well known. The breaking of this symmetry corresponds to the fundamental difference between information measures and normal geometric measures.
We are now at a point where we can use an information theoretic integral instead of a geometric one to calculate the distance between points. This method has (nearly) all the features of differential geometry, but the quantities are those of information theory.
Back to the Mutual Information
So, let us go back to the description of the mutual information. We know that it is Kullback-Leibler from the joint distribution to the a point with the same and
coordinates. But also, we can say a little more. Lines parallel with the
-axis are straight lines in terms of divergences (unlike normal geometry, this is not when the metric is constant). We can see this by checking that
for all and
. This is just the check I mentioned before. This holds because the
,
and
are linearly related to the probabilities. The result of this is that we can integrate Fisher information in a straight line along the
-axis in the Euclidean coordinates and get the Kullback-Leibler divergence.
In other words we can write that the mutual information, for a probability written in terms of :
where the Fisher information metric, , is given by:
The fact that the straight lines for the divergence are straight lines in the Cartesian space is important. In a very real sense, we are only comparing the joint distribution () with others with different
value (different
– different degrees of independence – but with the same value of
and
). Roughly, we could do anything to the geometry of the space that isn’t on the
line and the divergence would be unaffected.
I think that’s about all I have to say for now. I’m working on a more detailed tutorial for information geometry in general, I’ll post a link when I have finished it.
