long intuition

Sharing Secrets Publicly: Diffie-Hellman Key Exchange in Plain English

2021-08-21T05:36:00+00:00

What is a shared secret and why do we need it? It’s a piece of information that you and your friend know, but nobody else knows. You can use this shared secret to encrypt your communication, to ensure it is for your ears and your friend’s ears only.

Mathematically, we can model this secret as some integer. The existence of this integer is not secret—all integers are known to everyone—but knowing among the vast space of integers to use this one to decrypt communication between you and your friend is very hard and that is what forms the basis of this secret.

The Walls Have Ears

Arriving at the secret without constraints is not hard. Whisper into Bob the Builder’s ear that the integer is 42, and you have your shared secret key right there.

The hard part is if you assume the channel through which you are communicating is not safe.

If the walls have ears, how do you share your secret?

For example: on the internet, we have an insecure protocol (HTTP), yet you want to authenticate with your password and transfer funds securely. How do we make a fundamentally insecure channel secure?

The answer is with a bit of math. Enter Diffie-Hellman key exchange.

Diffie-Hellman, Incorporated

A startup, DH Inc. sells a matte black aluminum cube that promises to help you and your friend communicate securely over insecure channels—for the low, low price of 1 Bitcoin. (And because your friend needs one too, the total comes out to 2 Bitcoin, and DH Inc just raised a Series C from top-tier VCs based on this virality.)

Here’s how they say it will work:

You pick one prime number, and one DH number (we’ll come back to this).
Over an insecure channel, you send your friend these two numbers.
Both you and your friend pick secret numbers (your secret and friend secret)—these are not sent over the insecure channel—and enter them into the DH Inc cube.
The DH Inc cube will whirr and crank for a few seconds, and steam will come out of the top of the cube, and out pops a number—seemingly always less than the prime number you picked in step 1?
You send over the insecure channel the output of the DH Inc cube to your friend, and receive your friend’s DH Inc output number.
You input your friend’s DH Inc output number into your DH Inc cube, and your friend inputs the output you sent into their DH Inc cube.
Your cube whirrs again, even more steam hisses out the top, and the cube lights up green. Done! You get another number, and your friend gets another number outputted, and those two numbers are guaranteed to be the same. That “shared secret” can be used to encrypt further communication over the insecure channel. At no point in time did you send over your secret, or did your friend send over their secret—those are still within the DH Cube. All you sent was a prime number, a special DH number, and the outputs of the cube. Other folks, even if they buy DH cubes, and observe all outputs you send, cannot replicate the shared secret because they don’t have the individual secrets trapped within your DH cube and your friend’s DH cube.
You transmit information about DH Inc’s latest funding round to your friend and gossip back and forth.

You, being this generation’s young Steve Wozniak, want to make an open source version of this box. To do that, you have to figure out how it works.

How does it work? RTFM

Luckily, you find an FAQ on DH Inc’s website. It turns out you don’t need the hardware at all!

All the box does is:

Take an input number to the power of your secret (aka, multiply the input number by itself x times where x is your secret):
- \[(\text{input number})^{\text{your secret}} = \text{input number} \cdot \text{input number} \cdot \ldots \cdot \text{input number}\]
Take that result and modulo it by the prime number you picked:
- \[(\text{input number})^{\text{your secret}} \mod (\text{prime number you picked})\]
- A modulus just takes the remainder of a division by that number.
- For example: 7 modulo 5 → 5 goes into 7 once → you have 2 left over → 7 modulo 5 is 2.
- Another example: 19 modulo 7 → 7 goes into 19 twice (7 * 2 is 14) → 19 - (7 * 2) = 19 - 14 = 5 leftover → 19 modulo 7 is 5.
- It’s also called “clock arithmetic”: if it is 9PM where you are and your friend is 6 hours ahead of you, you “wrap around” the clock to get 3AM instead of 15PM.
  - In a remote world, people get very good at clock arithmetic as they have to balance times across time zones :) - Essentially, if the result of the many multiplications above is extremely large (powers grow quickly, after all), we map that very large number back to the space between 1 and the prime number you picked. We’ll go into why later.
The cube then sleeps 3 seconds, generates steam, hisses and shakes, and outputs the number back to you.

Unbelievably, you’re doing all the hard work here! You’re finding a prime, the special DH number, and all this box does is some arithmetic and generate steam. We’ll come back to how to pick the prime and DH number later.

The first question in our minds is: how does this box possibly arrive at the same number for two different people after all these steps?

Let’s visualize what’s going on here.

On the first run of the DH cube, we’re taking the special DH number you picked in step 1, multiplying it by itself an arbitrary number of times (specified by your secret), and performing a modulus operation—just taking the remainder with respect to division by the prime (also picked in step 1).

\[(\text{input number})^{\text{your secret}} \mod (\text{prime}) \rightarrow \text{send to friend}\]

In the next step, we don’t send your secret to your friend—that would let anyone listening on the insecure channel know your secret number and then derive the shared secret. Instead, we send the result of this computation to your friend over the insecure channel.

Why won’t this give away the shared secret we derive later? Let’s take a look at the next step to find out.

You take the output of the DH Cube, and send it to your friend.

Then, your friend inputs that number into their DH Cube and gets the shared secret. How is this done?

We figured out that the DH Cube is only doing one thing: multiplying an input by itself x times and taking the remainder with respect to a prime.

Your friend’s cube is doing the same operation, just with their secret.

\[\begin{align*} &(\text{DH Cube output you sent to your friend})^{\text{friend secret}} \mod (\text{prime}) \\ &= \left( (\text{DH number})^{\text{your secret}} \mod (\text{prime}) \right)^{\text{friend secret}} \mod (\text{prime}) \\ &= \text{friend-derived secret} \end{align*}\]

Not only did you send your DH cube output to your friend over the insecure channel, you also received your friend’s DH cube output. What does that look like?

\[\begin{align*} &(\text{DH Cube output you received from your friend})^{\text{your secret}} \mod (\text{prime}) \\ &= \left( (\text{DH number})^{\text{friend secret}} \mod (\text{prime}) \right)^{\text{your secret}} \mod (\text{prime}) \\ &= \text{your derived secret} \end{align*}\]

And the key thing we’re saying—how this all works—is they’re the same.

\[\text{friend-derived secret} = \text{your derived secret} \\ = \text{shared secret}\]

Here’s a short proof for how this works—we basically want to show that it doesn’t matter in which order you exponentiate the DH number and modulo it (whether it’s your secret first or your friend’s secret first), you’ll get the same thing.

As we’ve observed, exponentiation is just multiplying something by itself some number of times. So we’ll prove commutativity of multiplication in modulus space; and if we find multiplication works commutatively in modulus space, we know exponentiation works in modulus space.

Let’s try to prove multiplicative commutativity in modulo space. Here’s our desired result.

\[\big((a \mod n) \cdot (b \mod n) \big) \mod n \\ = (a \cdot b) \mod n \\ \text{if we can prove the above, then by commutativity of normal multiplication,} \\ = (b \cdot a) \mod n\]

Let’s start with the product of a and b modulo n.

\[(a \cdot b) \mod n\]

What is a, and what is b?

Another way of writing a is:

\[a = \text{root}_a + \text{constant}_a \cdot n \implies a \mod n = \text{root}_a \\ b = \text{root}_b + \text{constant}_b \cdot n \implies b \mod n = \text{root}_b\]

We’ll abbreviate root and constant. The following means the same thing as above, but makes the proof a bit cleaner.

\[a = \text{r}_a + \text{k}_a \cdot n \implies a \mod n = \text{r}_a \\ b = \text{r}_b + \text{k}_b \cdot n \implies b \mod n = \text{r}_b\]

We’ll plug the definition of a and b into the product.

\[(a \cdot b) \mod n \\ = \big((\text{r}_a + \text{k}_a \cdot n ) \cdot ( \text{r}_b + \text{k}_b \cdot n)\big) \mod n \\ = (r_ar_b + r_ak_bn+r_bk_an+k_ak_bn^2 ) \mod n\]

We can group the terms:

\[=\big(r_ar_b +n\cdot( r_ak_b+r_bk_a+k_ak_bn) \big) \mod n\]

Because we’re working in mod space n, we know that n * anything is going to be 0 when modulo’d with n. So we can rewrite the above as:

\[= (r_ar_b) \mod n\]

We noted above that

\[\text{r}_a = a \mod n \\ \text{r}_b = b \mod n\]

So let’s substitute:

\[(a \cdot b) \mod n \\ = (r_a \cdot r_b) \mod n \\ = \big( (a \mod n) \cdot (b \mod n) \big) \mod n\]

Which is our desired result!

Mapping this back to exponentiation: because exponentiation is iterative multiplication, we can replace the b term with a to see how the above works with powers.

\[\big( (a \mod n) \cdot (a \mod n) \big) \mod n \\ \begin{align*} &= (a \cdot a) &\mod n \\ &= a^2 &\mod n \end{align*}\]

Connecting this to our example above: the shared secret is the same because the order of the exponentiation doesn’t matter. Your friend receives the result that you exponentiated with your secret and exponentiates that result; you receive the result that your friend exponentiated with their secret and exponentiated that. You get the same result, and that’s your shared secret key.

\[\begin{align*} &(\text{DH Cube output you sent to your friend})^{\text{friend secret}} \mod (\text{prime}) \\ &= (\text{DH number})^{\text{your secret}} \mod (\text{prime})^{\text{friend secret}} \mod (\text{prime}) \\ &= (\text{DH number})^{\text{your secret} \cdot \text{friend secret}} \mod (\text{prime}) \\ \\ &= \text{friend-derived secret} = \text{your derived secret} = \text{shared secret} \\ \\ &= (\text{DH number})^{\text{friend secret} \cdot \text{your secret}} \mod (\text{prime}) \\ &= \left( (\text{DH number})^{\text{friend secret}} \mod (\text{prime}) \right)^{\text{your secret}} \mod (\text{prime}) \\ &= (\text{DH Cube output you received from your friend})^{\text{your secret}} \mod (\text{prime}) \end{align*}\]

There are nuances embedded in this, but this in broad strokes is how you can make a fundamentally insecure channel secure—all without paying a Bitcoin for the DH box.

Below, we’ll investigate some of the nuances.

An example with real numbers

Let’s pick a random prime number (229) and a random DH number for that prime number (90).

We’ll work out the Diffie-Hellman key exchange manually for these two numbers.

Let’s say you pick a secret 673 and your friend picks a secret 404. You do not tell each other your secrets.

Let’s run the open-source version of the DH Cube.

\[(\text{DH number})^\text{your secret} \mod (\text{prime}) \\ \begin{align*} \\&= 90^{673} \mod 229 \\&= 24 \end{align*}\]

You pass this number 24 to your friend. Your friend runs the open-source DH Cube on this number and includes their secret in the computation.

\[(\text{your result})^\text{friend secret} \mod (\text{prime})\\ \begin{align*} &=24^{404} \mod 229 \\ &= 144 \end{align*}\]

Your friend knows the shared secret is 144, but you don’t know that yet because you haven’t received the first result from your friend.

So your friend writes down 144 for safekeeping (taking care to not send that information to you). They clear out your result from the machine, inputting the DH number instead.

\[(\text{DH number})^\text{friend secret} \mod (\text{prime}) \\ \begin{align*} \\&= 90^{404} \mod 229 \\&= 91 \end{align*}\]

Your friend sends you this number, 91, and you run the open-source box with your secret, 673.

\[(\text{friend result})^\text{friend secret} \mod (\text{prime})\\ \begin{align*} &=91^{673} \mod 229 \\ &= 144 \end{align*}\]

Now you know, thanks to a reverse-engineered DH Inc machine, a shared secret with your friend!

What is the special DH Inc number?

We know that this DH Inc number gets multiplied with itself many times before getting modulo’d, and we don’t know exactly how many times beforehand, because exactly how many times is the product of your secret key and your friend’s secret key.

This requires picking a special number—one that, once exponentiated any number of times and modulo’d, will eventually retrieve the set of numbers between 1 and the prime.

Why is this part important? If the exponentiated and modulo’d result only maps to a few values out of the possible set of values, then that means there are much fewer choices for the shared secret. And because the DH number gets sent over the insecure channel, a nosy listener could figure out that the periodicity of the resulting generated values is much lower than what the prime implies, and attempt to decode your subsequent ostensibly secret gossip about the Kardashians with your friend.

The below is an example of a “good” DH number, from the example above (90 with prime 229). This “good” DH number is also called a primitive root, or a generator, because it generates the set of possible residues for that prime number after successive power + modulo operations.

The column order is flipped to make it easier to verify that for every expected generated value (1..228) we have an exponent that maps to that number.

In this case, the number 90 is special because, when exponentiated and modulo’d, we are able to retrieve all numbers between 1 and 229 (this is termed the least residue system modulo 229).

If we try the same thing with DH number as 89, for example, we only get 12 unique numbers/residues (1, 18, 89, 94, 95, 107, 122, 134, 135, 140, 211, 228) instead of 228 unique residues. This makes the secret generation much weaker because only 12/228, or ~5%, of the space of possible residues are used for possible secrets.

In practice, the prime space is huge, and the DH numbers are good, so that generating the secret space would take an unimaginably long time.

Trivia

The inventor of public key cryptography—six years prior to Diffie & Hellman’s work—had their work classified for his entire life as it was discovered during employment by the British intelligence service. James H. Ellis died a month before his work was given public acknowledgement.

A Maximum Entropy Intuition for Fundamental Statistical Distributions

2020-07-20T05:36:00+00:00

Imagine you wake up tomorrow in an empty white room, a la The Matrix. You don’t remember how you got there. Anything can happen.

If Keanu Reeves appears, you’d probably think this is related in some way to the Matrix.
- If Keanu appears and then Laurence Fishburne (Morpheus) appears, you’d think—okay, this is almost definitely related to the Matrix.
On the other hand, if Obama and Clinton appear in your white room, in your head, you’d think, okay, the Matrix-related possibilities are less likely; the politics-related possibilities are more likely, whatever those are.

For the first few seconds in that empty white room, without knowing anything, everything is pretty much equally likely to us. In statistics, we call this a uniform distribution. It’s a good starting point when we know nothing. However, once we get new information, we shift probability mass from the less likely events to the more likely events, conditional on what we’ve just learned—in the Neo case, from Obama/Clinton related probabilities to Matrix-related probabilities; in the Obama/Clinton case, from Matrix-related probabilities to political event-related probabilities.

Often, in statistics education, we learn distributions in a vacuum of intuition. But, inevitably, we ask ourselves:

Why do we use the statistical distributions we use? For example, why is the Normal Distribution everywhere?

We’ll find that statistical distributions aren’t pulled out of thin air. The statistical distributions we’re most familiar with—uniform, exponential, Normal—are exactly determined when we want to maximize our information gain from very simple and very few initial constraints.

We’ll find that we can use our intuition from the Matrix example to help us understand where these statistical distributions come from!

Lean, Mean Information

First, we need some intuition as to what expected information gain means.

Let’s start with the commonplace notion of an “average”. The average, in mathematical terms, is a sum of the value of each event weighted by the probability of that event occurring.

For example, if we have a rigged die with a heavy “six” side, we would expect the next value to be higher than the next value if we used a fair die. The higher frequency of occurrence of sixes pulls up the expected value (also known as the average, mean, or mathematical expectation).

Mathematically, what happens is we weigh the value of each event by the probability of that event occurring, and the sum gets us a rough idea of where the next numerical value will land.

\[\text{mathematical expectation} = p(x) \cdot x \text{ for all } x \\ = \sum_i p(x_i) * x_i\]

This basic concept of weighing things by the probability of those things occurring is a very useful concept. We can also weigh the information gain of an event occurring by the probability of that event occurring to get an expected information value across all the events we care about. But how do we measure information gain?

Intuitively we know that the more surprising something is, the more information it contains. In other words, the informational value of an event is proportional to all the choices it killed off by virtue of that event occurring.

The information value of an event is related to how much probability mass it moves versus itself once that thing occurs.

Leverage Can Be Surprising

One interesting way to think about this is leverage. Roughly, leverage means how much mass you move versus your own mass. In financial markets, if you outlay $1 million for $5 million of exposure, you’re levered 5 times. For our purposes, we want a good way to formalize our intuitional understanding of information; I haven’t seen information talked about in leverage terms elsewhere and I think it’s an… informative way to look at things.

\[\text{leverage} \propto \frac{\text{exposure controlled}}{\text{initial outlay}}\]

When we talk about “how much probability mass an event moves” or the amount of choices an event kills by virtue of its occurrence, this is in some sense a leverage ratio. What this looks like is the total amount of probability (normalized, we say 1, but it could just as well be some arbitrary sum, like 10000) divided by the probability of that particular event (p). The 10,000 factor cancels out when we divide the total by the individual probability, so we just get

\[\text{info} \propto \frac{1}{p}\]

Binary is, in a sense, the ultimate form of compression. Boiling things down to the most informative, basic essence of truth or falsity is a beautiful feature of a bit. We can count the number of bits needed to represent a value by taking its logarithm, base two, so we get

\[\text{info} \propto \log_2{\frac{1}{p}}\]

And if we weigh this by the probability of that particular event happening, we get

\[\text{info} \propto p \cdot \log_2{\frac{1}{p}}\]

And if we use the simplified version, we get

\[\text{info} = p \cdot \log_2{\frac{1}{p}} \\ = p \cdot (\log_21 - \log_2p) \\ =-p \cdot \log_2p\]

Awesome! We’ve built the definition of informational entropy from nothing other than a… bit… of intuition. Similar to our understanding for the mathematical expected value of a set of events, we can talk about the mathematical expected information for a set of events.

\[\text{mathematical information expectation} = \sum_i -p(x_i) * \log_2{p(x_i)}\]

Why is this useful? It turns out that the major statistical distributions maximize the expected information gain subject to certain constraints (each major distribution corresponding to different constraints).

Stated in a different way:

Take that our goal is to model the probability distribution for data we’re looking at.

We generally know a few things about the data—these will be our constraints—and we want to pick the probability distribution that maximizes our expected information gain (aka, maximizes our subsequent surprise, or entropy)—because if we had a distribution that had any less expected information gain than **the maximum entropy distribution, we’ve inadvertently encoded some information extra to our constraints into our distribution.

So the maximum entropy distribution is the closest thing we can get to a zero-knowledge guess, subject to what we know about the data (our constraints).

Zero Knowledge Maximum Entropy Distribution

We found at the beginning of our journey that the uniform distribution—where we prescribe to each event an equal amount of probability mass—makes intuitive sense as the distribution we should pick when we don’t know anything at all. This isn’t saying that everything in reality has equal probability of occurring—a bit subtle; it’s just saying that, given what we currently know (assumed to be nothing), no one event is more likely than any other event.

What if we work from the mathematical end? What do we find if we just start out with very few, very basic assumptions and work forward?

\[\text{information, the quantity we want to maximize: } \\ f(x)=-\int_a^b p(x) \cdot \log_2p(x)\,dx \\ \text{unity constraint: }g(x)=\int_a^b p(x)\,dx - 1 = 0\]

In English, we want to maximize the information subject to the unity constraint, and we want to see what p(x) looks like.

Mathematically, we’re going to want to find the local extrema (local minima and maxima) of the information function along the unity constraint. Analogous to minimization and maximization in single-variable calculus, we want to find the points at which the derivative of our information function is zero along the constraint function. Intuitively, this should make sense—we want the extrema, and if the slope of the information function is (for example) greater than zero along the constraint, we would just walk along that direction, increasing our expected information gain along the way, all the while getting closer to a local maximum.

Finding where the derivative of f is zero along g is equivalent to saying the directional derivative of f along a vector s that lies on constraint g is zero.

Because the directional derivative of f along that vector s is zero, we know that the projection of the gradient of f on g is zero (aka, the dot product of the gradient of f and g is zero).

Therefore, we know that the gradient of f is parallel to the norm of the surface of g, so the gradient of f is parallel to the gradient of g.

In other words, the gradient of f is some scalar multiple of the gradient of g!

If we find where this occurs, we’ll have found the extrema.

If the above calc-related ideas sounds a bit unfamiliar, ping me at longintuition@protonmail.com so I know that there’s demand for me writing something on gradients.

Anyway, mathematically, we’re trying to do this:

\[\nabla f(x) = a \cdot \nabla g(x)\]

Which is equivalent to:

\[\frac{\partial f}{\partial p(x)} = a \cdot \frac{\partial g}{\partial p(x)}\]

Taking the derivative with respect to a function requires a bit of variational calculus, specifically the Euler Lagrange equation. Thankfully, we have some pretty easy functional derivatives here:

\[\frac{-1-\ln(p(x))}{\ln(2)}=a \cdot 1\]

Let’s simplify! We want to get an expression for p(x):

\[-1-\ln(p(x))=a \cdot \ln(2) \\ 1 + \ln(p(x)) = -a \cdot \ln(2) \\ \ln(p(x)) = -1-a \cdot \ln(2) \\ \implies p(x) = e^{-1-a\ln(2)} \\ p(x) =e^{-1} \cdot e^{-a\ln(2)} \\ p(x) = e^{-1} \cdot 2^{-a}\]

We’ll plug this expression into our unity constraint:

\[\int_a^b p(x)\,dx=1 \\ \int_a^b e^{-1} \cdot 2^{-a} \,dx = 1 \\ e^{-1} \cdot 2^{-a} \cdot \int_a^b \,dx = 1 \\ e^{-1} \cdot 2^{-a} \cdot (b-a) = 1 \\ e^{-1} \cdot 2^{-a} = \frac{1}{b-a}\]

This looks like p(x)!

\[p(x)=\frac{1}{b-a}\]

which is the PDF of a continuous uniform distribution!

This is super promising—the probability distribution that maximizes our surprise given we know basically nothing aside from a unity constraint is the uniform probability distribution!

What we’ve just done is confirm mathematically a very solid intuition we explored at the beginning of the piece!

Maximum Entropy Distribution Constrained By Expected Value

Very rarely do we know absolutely nothing about the data we have. At the very least, we can describe the data in “coarse” ways. One example of a frequently calculable coarse descriptor is the mean.

If the Uniform Distribution corresponds to zero knowledge, what distribution corresponds to knowledge of only the expected value? Let’s find out!

Again, we’ll have our expected information to maximize and the unity constraint. We’ll add one more constraint representing knowledge of the expected value.

\[\text{information, the quantity we want to maximize: } \\ f(x)=-\int_0^\infty p(x) \cdot \log_2p(x)\,dx \\ \text{unity constraint: }g(x)=\int_0^\infty p(x)\,dx - 1 = 0 \\ \text{expected value constraint: } h(x) = \int_0^\infty x \cdot p(x) \, dx - \mu = 0\]

We’re going to go through roughly the same steps as before. This time, however, because we’re dealing with multiple constraints, we have to increment our understanding of the minimization procedure.

Now, we need to meet not one, but two constraints. This is best understood geometrically by thinking of three dimensions—particularly, the intersection of two planes is a line which passes through the vector subspaces spanned by the norms of the two planes. The same concept applies here, though we’re not dealing here specifically with the intersection of two planes.

Specifically, we’re looking for is:

the extrema, defined as where the directional derivative of f along the constraint vector s is zero.
The gradient of f should be perpendicular to the constraint vector s;
so the gradient of f is parallel to the constraint gradient.
The constraint vector is orthogonal to the subspaces spanned by the norms of the individual constraint,
so the constraint vector is orthogonal to a linear combination of the norms of the individual constraint.
Because the constraint gradient is orthogonal to the constraint vector, the constraint gradient is parallel to a linear combination of the norms of the individual constraints,
and because we’re looking for where the gradient of f is parallel to the constraint gradient,
we’re looking for where the gradient of f is a linear combination of the norms of the individual constraints.

Whew! That was a lot, but it often helps to reason step by step through things, instead of memorizing the steps for “Lagrange Multipliers with Multiple Constraints”. A post with geometric intuition behind the above is coming (ping me at longintuition@protonmail.com if you want it to come sooner).

We end up with this:

\[\nabla f(x) = a \cdot \nabla g(x) + b \cdot \nabla h(x) \\ \frac{\partial f}{\partial p(x)} = a \cdot \frac{\partial g}{\partial p(x)} + b \cdot \frac{\partial h}{\partial p(x)}\]

Taking the functional derivatives, we get:

\[\frac{-1-\ln p(x)}{\ln2}=a \cdot 1 + b \cdot x\]

Let’s rearrange to see what expression we can uncover for p(x):

\[-1-\ln p(x) = (\ln 2) \cdot (a + b \cdot x) \\ 1 + \ln p(x) = -(\ln 2) \cdot (a + b \cdot x) \\ \ln p(x) = - 1 - (\ln 2) \cdot (a + b \cdot x) \\ \implies p(x) = e^{- 1 - (\ln 2) \cdot (a + b \cdot x)} \\ p(x) = e^{-1} \cdot e^{- (\ln 2) \cdot (a + b \cdot x)} \\ p(x) = e^{-1} \cdot 2^{-a-b\cdot x)} \\ p(x) = e^{-1} \cdot 2^{-a} \cdot 2^{-bx}\]

Let’s plug p(x) into the unity constraint:

\[\int_0^\infty p(x) \, dx = 1 \implies \int_0^\infty e^{-1} \cdot 2^{-a} \cdot 2^{-bx} \, dx = 1 \\ \implies e^{-1} \cdot 2^{-a} \cdot (b \cdot \ln 2)^{-1} = 1 \\ \implies e^{-1} \cdot 2^{-a}=b \cdot \ln 2\]

Now, let’s simplify our p(x) with what we’ve obtained:

\[p(x) = b \cdot \ln 2 \cdot 2^{-bx}\]

Which will be helpful as we plug it into our second constraint:

\[\int_0^\infty x \cdot p(x) \, dx = \mu \implies \int_0^\infty x \cdot b \cdot \ln 2 \cdot 2^{-bx}\, dx = \mu \\ \implies (b \cdot \ln 2)^{-1} = \mu \\ \implies b \cdot \ln 2 = \mu^{-1} \\ \implies b = (\mu \cdot \ln 2)^{-1}\]

Great! We can use this in refining our expression for p(x):

\[p(x) = b \cdot \ln 2 \cdot 2^{-bx} \implies p(x) = \mu^{-1} \cdot 2^{-bx} \\ \implies p(x) = \mu^{-1} \cdot 2^{-(\mu \cdot \ln 2)^{-1}x} \\ p(x) = \mu^{-1} \cdot e^{-\mu^{-1} \cdot x}\]

Often, we find it useful to rewrite the inverse of the mean as a separate symbol. For example, in exponential cases, if the mean represents the average number of events per time interval, the inversion represents the time interval per event, which can be useful in time-related inference.

\[\lambda = \mu^{-1} \\ \implies p(x) = \lambda \cdot e^{- \lambda x}\]

which is the PDF of an exponential distribution!! How cool is that?!

The Strangest, Most Abnormal Distribution

We’ve stumbled across the uniform and exponential distributions from little more than intuition and some conservative assumptions. The last distribution we’ll talk about appears everywhere, and for seemingly no good reason.

I was extremely confused as to why the Normal (Gaussian) Distribution pops up everywhere—in kurtotically-ignorant financial market analysis, in nature, everywhere. Thinking about it, the prevalence of the Gaussian is actually rather abnormal.

Statisticians are quick to reach for the Central Limit Theorem, but I think there’s a deeper, more intuitive, more powerful reason.

The Normal Distribution is your best guess if you only know the mean and the variance of your data.

It is your minimum-knowledge, maximum-entropy distribution if you know those two, easily-obtained coarse-grained data descriptors. Let’s find out how!

Often, we can measure how much the data deviates from what we expect. This “expected deviation” we call the standard deviation, and we can also add this as a constraint to determine the distribution that will maximize our expected information gain. The square of the standard deviation is called variance.

We’ll take the same information equation from before and add a constraint for variance. Because the constraint for variance implies the constraint for expected value, we can simplify our constraints a bit and exclude the expected value constraint.

\[\text{information, the quantity we want to maximize: } \\ f(x)=-\int_{-\infty}^\infty p(x) \cdot \log_2p(x)\,dx \\ \text{unity constraint: }g(x)=\int_{-\infty}^\infty p(x)\,dx - 1 = 0 \\ \text{variance constraint: } h(x) = \int_{-\infty}^\infty (x-\mu)^2 \cdot p(x) \, dx - \sigma^2 = 0\]

Let’s try to find where the gradient of f is equivalent to a linear combination of the individual constraint norms:

\[\nabla f = a \cdot \nabla g +b \cdot \nabla h \\ \frac{\partial f}{\partial p(x)} = a \cdot \frac{\partial g}{\partial p(x)} + b \cdot \frac{\partial h}{\partial p(x)}\]

We’re going to calculate the functional derivatives and see if we can isolate p(x):

\[\frac{-1-\ln p(x)}{\ln 2}=a \cdot 1 + b \cdot (x - \mu)^2 \\ -1-\ln p(x) = (\ln 2) \cdot (a + b \cdot (x - \mu)^2)\\ 1 + \ln p(x) = -(\ln 2) \cdot (a + b \cdot (x - \mu)^2) \\ \ln p(x) = -1 -(\ln 2) \cdot (a + b \cdot (x - \mu)^2) \\ \implies p(x) = e^{-1 -(\ln 2) \cdot (a + b \cdot (x - \mu)^2)} \\ p(x) = e^{-1} \cdot 2^{-(a + b\cdot(x-\mu)^2)} \\ p(x) = e^{-1} \cdot 2^{-a} \cdot 2^{-b \cdot (x-\mu)^2}\]

Let’s plug our p(x) into our unity constraint:

\[\int_{-\infty}^\infty p(x)\,dx=1 \implies \int_{-\infty}^\infty e^{-1} \cdot 2^{-a} \cdot 2^{-b \cdot (x-\mu)^2}\,dx=1 \\ e^{-1} \cdot 2^{-a} \cdot b^{-\frac{1}{2}}\cdot (\frac{\pi}{\ln 2})^{\frac{1}{2}} = 1 \\ e^{-1} \cdot 2^{-a} = b^{\frac{1}{2}}\cdot (\frac{\ln 2}{\pi})^{\frac{1}{2}}\]

Awesome! Let’s use this to refine our p(x) further:

\[p(x) = e^{-1} \cdot 2^{-a} \cdot 2^{-b \cdot (x-\mu)^2} \implies p(x) = b^{\frac{1}{2}}\cdot (\frac{\ln 2}{\pi})^{\frac{1}{2}} \cdot 2^{-b \cdot (x-\mu)^2}\]

And let’s plug this new, refined expression into our variance constraint:

\[\int_{-\infty}^\infty(x-\mu)^2 \cdot p(x) \, dx = \sigma^2 \implies \int_{-\infty}^\infty(x-\mu)^2 \cdot b^{\frac{1}{2}}\cdot (\frac{\ln 2}{\pi})^{\frac{1}{2}} \cdot 2^{-b \cdot (x-\mu)^2} \, dx = \sigma^2 \\ \implies (b \cdot 2\ln2)^{-1} = \sigma^2 \implies (b \cdot \ln 2)^{\frac{1}{2}} =\sigma^{-1} \cdot 2^{-\frac{1}{2}} \\ \text{and } b = \frac{1}{2 \sigma^2 \ln 2}\]

If we rewrite our p(x), we can get a better idea of where to plug both of these in:

\[p(x) = b^{\frac{1}{2}}\cdot (\frac{\ln 2}{\pi})^{\frac{1}{2}} \cdot 2^{-b \cdot (x-\mu)^2} = b^{\frac{1}{2}}\cdot (\ln 2)^{\frac{1}{2}} \cdot \pi^{-\frac{1}{2}} \cdot 2^{-b \cdot (x-\mu)^2} \\ =(b \cdot \ln 2)^{\frac{1}{2}} \cdot \pi^{-\frac{1}{2}}\cdot 2^{-b \cdot (x-\mu)^2} \\ = \sigma^{-1} \cdot 2^{-\frac{1}{2}} \cdot \pi^{-\frac{1}{2}}\cdot 2^{-b \cdot (x-\mu)^2} \\ = \sigma^{-1} \cdot 2^{-\frac{1}{2}} \cdot \pi^{-\frac{1}{2}}\cdot 2^{-\frac{(x-\mu)^2}{2 \sigma^2 \ln 2}}\]

With a change of base, we have:

\[p(x) = \sigma^{-1} \cdot 2^{-\frac{1}{2}} \cdot \pi^{-\frac{1}{2}}\cdot e^{-\frac{(x-\mu)^2}{2 \sigma^2}} \\ = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2 \sigma^2}}\]

Let’s pretty it up a bit:

\[p(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}(\frac{x-\mu}{ \sigma})^2}\]

Wow–this is the PDF of a Normal Distribution! We finally understand why the Normal Distribution is everywhere—we’ve proven it to ourselves that, out of very simple assumptions of just a mean and a variance, the Normal Distribution is the distribution we must choose.

In other words, the Normal Distribution is the maximum entropy distribution for a specified mean and variance. Beautiful!

Wrapping It Up

We’ve learned that the true zero knowledge distribution is a uniform distribution, and extensions of this concept of a “zero knowledge” distribution (other than your constraints) yield the exponential distribution when mean-constrained and the Gaussian when volatility-constrained.

It’s difficult to find intuitive step-by-step walkthroughs of maximum entropy distributions and I really enjoy this perspective on statistics, so I wanted to share it with the world in the hopes that it might help people understand probability, statistics, and information a little better.