Imagine you wake up tomorrow in an empty white room, a la The Matrix. You don’t remember how you got there. Anything can happen.

  • If Keanu Reeves appears, you’d probably think this is related in some way to the Matrix.
    • If Keanu appears and then Laurence Fishburne (Morpheus) appears, you’d think—okay, this is almost definitely related to the Matrix.
  • On the other hand, if Obama and Clinton appear in your white room, in your head, you’d think, okay, the Matrix-related possibilities are less likely; the politics-related possibilities are more likely, whatever those are.

For the first few seconds in that empty white room, without knowing anything, everything is pretty much equally likely to us. In statistics, we call this a uniform distribution. It’s a good starting point when we know nothing. However, once we get new information, we shift probability mass from the less likely events to the more likely events, conditional on what we’ve just learned—in the Neo case, from Obama/Clinton related probabilities to Matrix-related probabilities; in the Obama/Clinton case, from Matrix-related probabilities to political event-related probabilities.

Often, in statistics education, we learn distributions in a vacuum of intuition. But, inevitably, we ask ourselves:

  • Why do we use the statistical distributions we use? For example, why is the Normal Distribution everywhere?

We’ll find that statistical distributions aren’t pulled out of thin air. The statistical distributions we’re most familiar with—uniform, exponential, Normal—are exactly determined when we want to maximize our information gain from very simple and very few initial constraints.

We’ll find that we can use our intuition from the Matrix example to help us understand where these statistical distributions come from!

Lean, Mean Information

First, we need some intuition as to what expected information gain means.

Let’s start with the commonplace notion of an “average”. The average, in mathematical terms, is a sum of the value of each event weighted by the probability of that event occurring.

For example, if we have a rigged die with a heavy “six” side, we would expect the next value to be higher than the next value if we used a fair die. The higher frequency of occurrence of sixes pulls up the expected value (also known as the average, mean, or mathematical expectation).

Mathematically, what happens is we weigh the value of each event by the probability of that event occurring, and the sum gets us a rough idea of where the next numerical value will land.

\[\text{mathematical expectation} = p(x) \cdot x \text{ for all } x \\ = \sum_i p(x_i) * x_i\]

This basic concept of weighing things by the probability of those things occurring is a very useful concept. We can also weigh the information gain of an event occurring by the probability of that event occurring to get an expected information value across all the events we care about. But how do we measure information gain?

Intuitively we know that the more surprising something is, the more information it contains. In other words, the informational value of an event is proportional to all the choices it killed off by virtue of that event occurring.

The information value of an event is related to how much probability mass it moves versus itself once that thing occurs.

Leverage Can Be Surprising

One interesting way to think about this is leverage. Roughly, leverage means how much mass you move versus your own mass. In financial markets, if you outlay $1 million for $5 million of exposure, you’re levered 5 times. For our purposes, we want a good way to formalize our intuitional understanding of information; I haven’t seen information talked about in leverage terms elsewhere and I think it’s an… informative way to look at things.

\[\text{leverage} \propto \frac{\text{exposure controlled}}{\text{initial outlay}}\]

When we talk about “how much probability mass an event moves” or the amount of choices an event kills by virtue of its occurrence, this is in some sense a leverage ratio. What this looks like is the total amount of probability (normalized, we say 1, but it could just as well be some arbitrary sum, like 10000) divided by the probability of that particular event (p). The 10,000 factor cancels out when we divide the total by the individual probability, so we just get

\[\text{info} \propto \frac{1}{p}\]

Binary is, in a sense, the ultimate form of compression. Boiling things down to the most informative, basic essence of truth or falsity is a beautiful feature of a bit. We can count the number of bits needed to represent a value by taking its logarithm, base two, so we get

\[\text{info} \propto \log_2{\frac{1}{p}}\]

And if we weigh this by the probability of that particular event happening, we get

\[\text{info} \propto p \cdot \log_2{\frac{1}{p}}\]

And if we use the simplified version, we get

\[\text{info} = p \cdot \log_2{\frac{1}{p}} \\ = p \cdot (\log_21 - \log_2p) \\ =-p \cdot \log_2p\]

Awesome! We’ve built the definition of informational entropy from nothing other than a… bit… of intuition. Similar to our understanding for the mathematical expected value of a set of events, we can talk about the mathematical expected information for a set of events.

\[\text{mathematical information expectation} = \sum_i -p(x_i) * \log_2{p(x_i)}\]

Why is this useful? It turns out that the major statistical distributions maximize the expected information gain subject to certain constraints (each major distribution corresponding to different constraints).

Stated in a different way:

Take that our goal is to model the probability distribution for data we’re looking at.

We generally know a few things about the data—these will be our constraints—and we want to pick the probability distribution that maximizes our expected information gain (aka, maximizes our subsequent surprise, or entropy)—because if we had a distribution that had any less expected information gain than **the maximum entropy distribution, we’ve inadvertently encoded some information extra to our constraints into our distribution.

So the maximum entropy distribution is the closest thing we can get to a zero-knowledge guess, subject to what we know about the data (our constraints).

Zero Knowledge Maximum Entropy Distribution

We found at the beginning of our journey that the uniform distribution—where we prescribe to each event an equal amount of probability mass—makes intuitive sense as the distribution we should pick when we don’t know anything at all. This isn’t saying that everything in reality has equal probability of occurring—a bit subtle; it’s just saying that, given what we currently know (assumed to be nothing), no one event is more likely than any other event.

What if we work from the mathematical end? What do we find if we just start out with very few, very basic assumptions and work forward?

\[\text{information, the quantity we want to maximize: } \\ f(x)=-\int_a^b p(x) \cdot \log_2p(x)\,dx \\ \text{unity constraint: }g(x)=\int_a^b p(x)\,dx - 1 = 0\]

In English, we want to maximize the information subject to the unity constraint, and we want to see what p(x) looks like.

Mathematically, we’re going to want to find the local extrema (local minima and maxima) of the information function along the unity constraint. Analogous to minimization and maximization in single-variable calculus, we want to find the points at which the derivative of our information function is zero along the constraint function. Intuitively, this should make sense—we want the extrema, and if the slope of the information function is (for example) greater than zero along the constraint, we would just walk along that direction, increasing our expected information gain along the way, all the while getting closer to a local maximum.

Finding where the derivative of f is zero along g is equivalent to saying the directional derivative of f along a vector s that lies on constraint g is zero.

Because the directional derivative of f along that vector s is zero, we know that the projection of the gradient of f on g is zero (aka, the dot product of the gradient of f and g is zero).

Therefore, we know that the gradient of f is parallel to the norm of the surface of g, so the gradient of f is parallel to the gradient of g.

In other words, the gradient of f is some scalar multiple of the gradient of g!

If we find where this occurs, we’ll have found the extrema.

If the above calc-related ideas sounds a bit unfamiliar, ping me at so I know that there’s demand for me writing something on gradients.

Anyway, mathematically, we’re trying to do this:

\[\nabla f(x) = a \cdot \nabla g(x)\]

Which is equivalent to:

\[\frac{\partial f}{\partial p(x)} = a \cdot \frac{\partial g}{\partial p(x)}\]

Taking the derivative with respect to a function requires a bit of variational calculus, specifically the Euler Lagrange equation. Thankfully, we have some pretty easy functional derivatives here:

\[\frac{-1-\ln(p(x))}{\ln(2)}=a \cdot 1\]

Let’s simplify! We want to get an expression for p(x):

\[-1-\ln(p(x))=a \cdot \ln(2) \\ 1 + \ln(p(x)) = -a \cdot \ln(2) \\ \ln(p(x)) = -1-a \cdot \ln(2) \\ \implies p(x) = e^{-1-a\ln(2)} \\ p(x) =e^{-1} \cdot e^{-a\ln(2)} \\ p(x) = e^{-1} \cdot 2^{-a}\]

We’ll plug this expression into our unity constraint:

\[\int_a^b p(x)\,dx=1 \\ \int_a^b e^{-1} \cdot 2^{-a} \,dx = 1 \\ e^{-1} \cdot 2^{-a} \cdot \int_a^b \,dx = 1 \\ e^{-1} \cdot 2^{-a} \cdot (b-a) = 1 \\ e^{-1} \cdot 2^{-a} = \frac{1}{b-a}\]

This looks like p(x)!


which is the PDF of a continuous uniform distribution!

This is super promising—the probability distribution that maximizes our surprise given we know basically nothing aside from a unity constraint is the uniform probability distribution!

What we’ve just done is confirm mathematically a very solid intuition we explored at the beginning of the piece!

Maximum Entropy Distribution Constrained By Expected Value

Very rarely do we know absolutely nothing about the data we have. At the very least, we can describe the data in “coarse” ways. One example of a frequently calculable coarse descriptor is the mean.

If the Uniform Distribution corresponds to zero knowledge, what distribution corresponds to knowledge of only the expected value? Let’s find out!

Again, we’ll have our expected information to maximize and the unity constraint. We’ll add one more constraint representing knowledge of the expected value.

\[\text{information, the quantity we want to maximize: } \\ f(x)=-\int_0^\infty p(x) \cdot \log_2p(x)\,dx \\ \text{unity constraint: }g(x)=\int_0^\infty p(x)\,dx - 1 = 0 \\ \text{expected value constraint: } h(x) = \int_0^\infty x \cdot p(x) \, dx - \mu = 0\]

We’re going to go through roughly the same steps as before. This time, however, because we’re dealing with multiple constraints, we have to increment our understanding of the minimization procedure.

Now, we need to meet not one, but two constraints. This is best understood geometrically by thinking of three dimensions—particularly, the intersection of two planes is a line which passes through the vector subspaces spanned by the norms of the two planes. The same concept applies here, though we’re not dealing here specifically with the intersection of two planes.

Specifically, we’re looking for is:

  • the extrema, defined as where the directional derivative of f along the constraint vector s is zero.
  • The gradient of f should be perpendicular to the constraint vector s;
  • so the gradient of f is parallel to the constraint gradient.
  • The constraint vector is orthogonal to the subspaces spanned by the norms of the individual constraint,
  • so the constraint vector is orthogonal to a linear combination of the norms of the individual constraint.
  • Because the constraint gradient is orthogonal to the constraint vector, the constraint gradient is parallel to a linear combination of the norms of the individual constraints,
  • and because we’re looking for where the gradient of f is parallel to the constraint gradient,
  • we’re looking for where the gradient of f is a linear combination of the norms of the individual constraints.

Whew! That was a lot, but it often helps to reason step by step through things, instead of memorizing the steps for “Lagrange Multipliers with Multiple Constraints”. A post with geometric intuition behind the above is coming (ping me at if you want it to come sooner).

We end up with this:

\[\nabla f(x) = a \cdot \nabla g(x) + b \cdot \nabla h(x) \\ \frac{\partial f}{\partial p(x)} = a \cdot \frac{\partial g}{\partial p(x)} + b \cdot \frac{\partial h}{\partial p(x)}\]

Taking the functional derivatives, we get:

\[\frac{-1-\ln p(x)}{\ln2}=a \cdot 1 + b \cdot x\]

Let’s rearrange to see what expression we can uncover for p(x):

\[-1-\ln p(x) = (\ln 2) \cdot (a + b \cdot x) \\ 1 + \ln p(x) = -(\ln 2) \cdot (a + b \cdot x) \\ \ln p(x) = - 1 - (\ln 2) \cdot (a + b \cdot x) \\ \implies p(x) = e^{- 1 - (\ln 2) \cdot (a + b \cdot x)} \\ p(x) = e^{-1} \cdot e^{- (\ln 2) \cdot (a + b \cdot x)} \\ p(x) = e^{-1} \cdot 2^{-a-b\cdot x)} \\ p(x) = e^{-1} \cdot 2^{-a} \cdot 2^{-bx}\]

Let’s plug p(x) into the unity constraint:

\[\int_0^\infty p(x) \, dx = 1 \implies \int_0^\infty e^{-1} \cdot 2^{-a} \cdot 2^{-bx} \, dx = 1 \\ \implies e^{-1} \cdot 2^{-a} \cdot (b \cdot \ln 2)^{-1} = 1 \\ \implies e^{-1} \cdot 2^{-a}=b \cdot \ln 2\]

Now, let’s simplify our p(x) with what we’ve obtained:

\[p(x) = b \cdot \ln 2 \cdot 2^{-bx}\]

Which will be helpful as we plug it into our second constraint:

\[\int_0^\infty x \cdot p(x) \, dx = \mu \implies \int_0^\infty x \cdot b \cdot \ln 2 \cdot 2^{-bx}\, dx = \mu \\ \implies (b \cdot \ln 2)^{-1} = \mu \\ \implies b \cdot \ln 2 = \mu^{-1} \\ \implies b = (\mu \cdot \ln 2)^{-1}\]

Great! We can use this in refining our expression for p(x):

\[p(x) = b \cdot \ln 2 \cdot 2^{-bx} \implies p(x) = \mu^{-1} \cdot 2^{-bx} \\ \implies p(x) = \mu^{-1} \cdot 2^{-(\mu \cdot \ln 2)^{-1}x} \\ p(x) = \mu^{-1} \cdot e^{-\mu^{-1} \cdot x}\]

Often, we find it useful to rewrite the inverse of the mean as a separate symbol. For example, in exponential cases, if the mean represents the average number of events per time interval, the inversion represents the time interval per event, which can be useful in time-related inference.

\[\lambda = \mu^{-1} \\ \implies p(x) = \lambda \cdot e^{- \lambda x}\]

which is the PDF of an exponential distribution!! How cool is that?!

The Strangest, Most Abnormal Distribution

We’ve stumbled across the uniform and exponential distributions from little more than intuition and some conservative assumptions. The last distribution we’ll talk about appears everywhere, and for seemingly no good reason.

I was extremely confused as to why the Normal (Gaussian) Distribution pops up everywhere—in kurtotically-ignorant financial market analysis, in nature, everywhere. Thinking about it, the prevalence of the Gaussian is actually rather abnormal.

Statisticians are quick to reach for the Central Limit Theorem, but I think there’s a deeper, more intuitive, more powerful reason.

The Normal Distribution is your best guess if you only know the mean and the variance of your data.

It is your minimum-knowledge, maximum-entropy distribution if you know those two, easily-obtained coarse-grained data descriptors. Let’s find out how!

Often, we can measure how much the data deviates from what we expect. This “expected deviation” we call the standard deviation, and we can also add this as a constraint to determine the distribution that will maximize our expected information gain. The square of the standard deviation is called variance.

We’ll take the same information equation from before and add a constraint for variance. Because the constraint for variance implies the constraint for expected value, we can simplify our constraints a bit and exclude the expected value constraint.

\[\text{information, the quantity we want to maximize: } \\ f(x)=-\int_{-\infty}^\infty p(x) \cdot \log_2p(x)\,dx \\ \text{unity constraint: }g(x)=\int_{-\infty}^\infty p(x)\,dx - 1 = 0 \\ \text{variance constraint: } h(x) = \int_{-\infty}^\infty (x-\mu)^2 \cdot p(x) \, dx - \sigma^2 = 0\]

Let’s try to find where the gradient of f is equivalent to a linear combination of the individual constraint norms:

\[\nabla f = a \cdot \nabla g +b \cdot \nabla h \\ \frac{\partial f}{\partial p(x)} = a \cdot \frac{\partial g}{\partial p(x)} + b \cdot \frac{\partial h}{\partial p(x)}\]

We’re going to calculate the functional derivatives and see if we can isolate p(x):

\[\frac{-1-\ln p(x)}{\ln 2}=a \cdot 1 + b \cdot (x - \mu)^2 \\ -1-\ln p(x) = (\ln 2) \cdot (a + b \cdot (x - \mu)^2)\\ 1 + \ln p(x) = -(\ln 2) \cdot (a + b \cdot (x - \mu)^2) \\ \ln p(x) = -1 -(\ln 2) \cdot (a + b \cdot (x - \mu)^2) \\ \implies p(x) = e^{-1 -(\ln 2) \cdot (a + b \cdot (x - \mu)^2)} \\ p(x) = e^{-1} \cdot 2^{-(a + b\cdot(x-\mu)^2)} \\ p(x) = e^{-1} \cdot 2^{-a} \cdot 2^{-b \cdot (x-\mu)^2}\]

Let’s plug our p(x) into our unity constraint:

\[\int_{-\infty}^\infty p(x)\,dx=1 \implies \int_{-\infty}^\infty e^{-1} \cdot 2^{-a} \cdot 2^{-b \cdot (x-\mu)^2}\,dx=1 \\ e^{-1} \cdot 2^{-a} \cdot b^{-\frac{1}{2}}\cdot (\frac{\pi}{\ln 2})^{\frac{1}{2}} = 1 \\ e^{-1} \cdot 2^{-a} = b^{\frac{1}{2}}\cdot (\frac{\ln 2}{\pi})^{\frac{1}{2}}\]

Awesome! Let’s use this to refine our p(x) further:

\[p(x) = e^{-1} \cdot 2^{-a} \cdot 2^{-b \cdot (x-\mu)^2} \implies p(x) = b^{\frac{1}{2}}\cdot (\frac{\ln 2}{\pi})^{\frac{1}{2}} \cdot 2^{-b \cdot (x-\mu)^2}\]

And let’s plug this new, refined expression into our variance constraint:

\[\int_{-\infty}^\infty(x-\mu)^2 \cdot p(x) \, dx = \sigma^2 \implies \int_{-\infty}^\infty(x-\mu)^2 \cdot b^{\frac{1}{2}}\cdot (\frac{\ln 2}{\pi})^{\frac{1}{2}} \cdot 2^{-b \cdot (x-\mu)^2} \, dx = \sigma^2 \\ \implies (b \cdot 2\ln2)^{-1} = \sigma^2 \implies (b \cdot \ln 2)^{\frac{1}{2}} =\sigma^{-1} \cdot 2^{-\frac{1}{2}} \\ \text{and } b = \frac{1}{2 \sigma^2 \ln 2}\]

If we rewrite our p(x), we can get a better idea of where to plug both of these in:

\[p(x) = b^{\frac{1}{2}}\cdot (\frac{\ln 2}{\pi})^{\frac{1}{2}} \cdot 2^{-b \cdot (x-\mu)^2} = b^{\frac{1}{2}}\cdot (\ln 2)^{\frac{1}{2}} \cdot \pi^{-\frac{1}{2}} \cdot 2^{-b \cdot (x-\mu)^2} \\ =(b \cdot \ln 2)^{\frac{1}{2}} \cdot \pi^{-\frac{1}{2}}\cdot 2^{-b \cdot (x-\mu)^2} \\ = \sigma^{-1} \cdot 2^{-\frac{1}{2}} \cdot \pi^{-\frac{1}{2}}\cdot 2^{-b \cdot (x-\mu)^2} \\ = \sigma^{-1} \cdot 2^{-\frac{1}{2}} \cdot \pi^{-\frac{1}{2}}\cdot 2^{-\frac{(x-\mu)^2}{2 \sigma^2 \ln 2}}\]

With a change of base, we have:

\[p(x) = \sigma^{-1} \cdot 2^{-\frac{1}{2}} \cdot \pi^{-\frac{1}{2}}\cdot e^{-\frac{(x-\mu)^2}{2 \sigma^2}} \\ = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2 \sigma^2}}\]

Let’s pretty it up a bit:

\[p(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}(\frac{x-\mu}{ \sigma})^2}\]

Wow–this is the PDF of a Normal Distribution! We finally understand why the Normal Distribution is everywhere—we’ve proven it to ourselves that, out of very simple assumptions of just a mean and a variance, the Normal Distribution is the distribution we must choose.

In other words, the Normal Distribution is the maximum entropy distribution for a specified mean and variance. Beautiful!

Wrapping It Up

We’ve learned that the true zero knowledge distribution is a uniform distribution, and extensions of this concept of a “zero knowledge” distribution (other than your constraints) yield the exponential distribution when mean-constrained and the Gaussian when volatility-constrained.

It’s difficult to find intuitive step-by-step walkthroughs of maximum entropy distributions and I really enjoy this perspective on statistics, so I wanted to share it with the world in the hopes that it might help people understand probability, statistics, and information a little better.