Understanding the normal distribution

2020-04-15T00:00:00+00:00

I thought it’d be a good idea to start the blog with a probability distribution that appears everywhere from distribution of height to blood pressure to SAT scores. It looked monstrous the first time I saw it and I ended up having to memorize it for tests.

I mean just look at it! $p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp{-\frac{(x-\mu)^2}{2\sigma^2}}$

Since I hate memorizing, this is my attempt to understand how it is derived.

Let’s start with discrete distributions, that is a random variable $X$ takes a finite number of values. Imagine that you’re trying to distribute N identical objects among bins s.t. there are $n_i$ objects in bin i. There are $N!$ ways of doing this (N ways to choose the first object, N-1 ways to choose the second, and so on until there’s only one way). But we don’t want there to be permutations within each bin.

In an example with 3 bins and 6 items we treat the first two orderings as one, but the next two orderings as seperate

1,2,3	4	5,6
1,3,2	4	5,6

1,2,3	4	5,6
1,3,2	4	6,5

To avoid including these intra-bin permutations, we divide by all the different orderings in each bin. Generally there is $W = \frac{N!}{\prod_i n_i}$ different ways of putting the items into the bins. In our example, $n_1 = 3$ , $n_2 = 1$ , $n_3 = 2$ and so we get 60 ways.

Now say we wanted to find out the probability of an item being assigned to bin i, $p(x_i)$ . How would we express this? Intuitively it’s just the number of items in bin i / the total number of items. $\lim_{N\to\infty} \frac{n_i}{N}$

Do these two quantities look related? That’s because they are. We won’t prove it here, but essentially we transform our $W$ into a measure called entropy as a function of the probability of an item going into each bin. All this means is instead of using our raw counts, we use a normalized count of items in a bin.

We end up with this expression:

$H(p) = -\sum_i p(x_i) \log p(x_i)$

The log base doesn’t matter but for these examples we will use 2. Entropy is proportional to the number of ways of assigning items into bins. High entropy is associated with many ways of reaching a particular outcome.

OK, well what does this have to do the normal (Gaussian) distribution? Basically, the normal distribution is a distribution that maximizes entropy. But before we do that, let’s find the maximum entropy distribution for discrete distributions.

The entropy for the example is $-(1/2*-1 + 1/6*-2.6 + 1/3*-1.6) = 1.47$ . Let’s try moving one item from bin 3 to bin 1. The entropy becomes $1.26$ (corresponding to 30 ways). This also happens to be the most uneven distribution, so maybe it has something to do with that? Testing our hypothesis, the most even distribution has an entropy of $1.6$ (corresponding to 90 ways).

OK, let’s try to generalize this by proving the uniform distribution is the maximum entropy distribution using standard stuff learned in Calc III.

So the function we want to maximize is $f(p_1,...,p_n) = -\sum_i^n p(x_i) \log p(x_i)$ subject to the constraint $g(p_1,...,p_n) = \sum_i^n p(x_i) = 1$ That is, the probabilities have to sum up to one.

You might be thinking how can we take a derivative with respect to a function. The reason is the random variable only takes on a finite number of values, and so we can treat each value as a variable.

The Lagrange function is $L(\bf{p}, \lambda) = f(p_1,...,p_n) - \lambda (g(p_1,...,p_n)-c)$

We get n equations from $\frac{\partial f(\bf p)}{\partial p_i} = -(\log p(x_i)+1)$ using the Chain Rule, and $\frac{\partial g(\bf p)}{\partial p_i} = -\lambda$

Putting it together $\frac{L(\bf p, \lambda)}{\partial p_i} = -(\log p(x_i)+1) -\lambda = 0$

We find that $p(x_i)$ is constant, and so it must be equal for all values of $x_i$

Therefore $\sum_i p(x_i) = np(x) = 1$ , and so $p(x) = \frac{1}{n}$

What about continuous distributions? The natural first choice is the continuous uniform distribution, where X takes on an infinite number of values. However, we have to be careful about what we say here. It only makes sense to compare distributions with equal variance $Var(X) = (X-E(X))^2$ , which puts a constraint on the space of distributions.

We can essentially apply the same analysis to normal distributions. Whereas before we were solving for a finite set of values, we now need to use calculus of variations. I’ll cover this more in depth in a seperate post another time.

For continuous distributions, we use something of an analogous form called differential entropy.

$-\int p(x) \log p(x) dx$

Same Lagrange function setup as before, except now there’s two constraints.

Definition of a probability density function: $$ \int p(x) dx = 1 $$
Definition of variance: $$ \int p(x) (x-\mu)^2 dx = \sigma^2 $$

We can manipulate (2) to $\int p(x) x^2 dx = \mu^2 + \sigma^2$

$L(p, \lambda) = -\int p(x) \log p(x) dx - \lambda_1(\int p(x) dx - 1) - \lambda_2(\int p(x) x^2 dx - (\mu^2 + \sigma^2))$

Our integrals are of the form $F(y) = \int f(x, y(x), y'(x)) dx$ where $y(x) = p(x)$ .

The Lagrange equation $\frac{\partial F(y)}{\partial f(y)} = \frac{\partial f}{\partial y} - \frac{d}{dx} \frac{\partial L}{\partial y'}$ can then be applied directly. There is no $y'(x)$ in the integral so we get

$\frac{\partial L(p(x), \lambda)}{\partial p(x)} = 1 - \log p(x) + \lambda_1 - \lambda_2 x^2 = 0$

Solve for $p(x)$ to get $p(x) = \exp(\lambda_1-\lambda_2 x^2 + 1)$

We can now solve for $\lambda_1$ and $\lambda_2$ from our constraints, plug it in and rearrange to get the normal distribution $p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \frac{(x-\mu)^2}{2\sigma^2}$

So out of the distributions parameterized by equal variance $\sigma^2$ , the normal distribution has the highest entropy.

To compare distributions of equal variance we can use the generalized normal distribution We vary $\beta$ while keeping $\sigma^2$ fixed to get $\alpha$ . $\beta = 2$ for the Gaussian distribution. You can see in the figure below, where in the legend first number is $\beta$ and second is the entropy, that the Gaussian has the highest entropy.

What are the implications of being the maximum entropy distribution with finite variance? If we have some data and we don’t know anything about it, besides its sample mean and variance our best bet is the normal distribution because it is the most likely distribution.

References:

Time to give credits where its due

Much of the bin exposition is taken from Bishop which actually comes from Graham Wallis as I learned from Statistical Rethinking (how much of the knowledge one posseses is original??). Inspiration for generalized normal was taken from Statistical Rethinking.

Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.
McElreath, Richard. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC press, 2020.

ProxyCausal

Understanding the normal distribution

References: