<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.8.5">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2020-05-07T20:35:07+00:00</updated><id>/feed.xml</id><title type="html">ProxyCausal</title><subtitle>Just another graduate student trying to figure things out</subtitle><entry><title type="html">Understanding the normal distribution</title><link href="/stats/2020/04/15/gaussian.html" rel="alternate" type="text/html" title="Understanding the normal distribution" /><published>2020-04-15T00:00:00+00:00</published><updated>2020-04-15T00:00:00+00:00</updated><id>/stats/2020/04/15/gaussian</id><content type="html" xml:base="/stats/2020/04/15/gaussian.html">&lt;p&gt;I thought it’d be a good idea to start the blog with a probability distribution that appears everywhere 
from distribution of height to blood pressure to SAT scores.
It looked monstrous the first time I saw it and I ended up having to memorize it for tests.&lt;/p&gt;

&lt;p&gt;I mean just look at it!
&lt;script type=&quot;math/tex&quot;&gt;p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp{-\frac{(x-\mu)^2}{2\sigma^2}}&lt;/script&gt;&lt;/p&gt;

&lt;p&gt;Since I hate memorizing, this is my attempt to understand how it is derived.&lt;/p&gt;

&lt;p&gt;Let’s start with discrete distributions, that is a random variable &lt;script type=&quot;math/tex&quot;&gt;X&lt;/script&gt; takes a finite number of values.
Imagine that you’re trying to distribute N identical objects among bins s.t. there are &lt;script type=&quot;math/tex&quot;&gt;n_i&lt;/script&gt; objects in bin i. There are &lt;script type=&quot;math/tex&quot;&gt;N!&lt;/script&gt; ways of doing this
(N ways to choose the first object, N-1 ways to choose the second, and so on until there’s only one way). But we don’t want there to be permutations within each bin.&lt;/p&gt;

&lt;p&gt;In an example with 3 bins and 6 items we treat the first two orderings as one, but the next two orderings as seperate&lt;/p&gt;

&lt;table&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;1,2,3&lt;/td&gt;
      &lt;td&gt;4&lt;/td&gt;
      &lt;td&gt;5,6&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;1,3,2&lt;/td&gt;
      &lt;td&gt;4&lt;/td&gt;
      &lt;td&gt;5,6&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;table&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;1,2,3&lt;/td&gt;
      &lt;td&gt;4&lt;/td&gt;
      &lt;td&gt;5,6&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;1,3,2&lt;/td&gt;
      &lt;td&gt;4&lt;/td&gt;
      &lt;td&gt;6,5&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;To avoid including these intra-bin permutations, we divide by all the different orderings in each bin.
Generally there is &lt;script type=&quot;math/tex&quot;&gt;W = \frac{N!}{\prod_i n_i}&lt;/script&gt; different ways of putting the items into the bins. In our example, &lt;script type=&quot;math/tex&quot;&gt;n_1 = 3&lt;/script&gt;, &lt;script type=&quot;math/tex&quot;&gt;n_2 = 1&lt;/script&gt;, &lt;script type=&quot;math/tex&quot;&gt;n_3 = 2&lt;/script&gt;
and so we get 60 ways.&lt;/p&gt;

&lt;p&gt;Now say we wanted to find out the probability of an item being assigned to bin i, &lt;script type=&quot;math/tex&quot;&gt;p(x_i)&lt;/script&gt;. How would we express this? 
Intuitively it’s just the number of items in bin i / the total number of items.
&lt;script type=&quot;math/tex&quot;&gt;\lim_{N\to\infty} \frac{n_i}{N}&lt;/script&gt;&lt;/p&gt;

&lt;p&gt;Do these two quantities look related? That’s because they are. We won’t prove it here, but essentially we transform our &lt;script type=&quot;math/tex&quot;&gt;W&lt;/script&gt; into a measure called entropy
as a function of the probability of an item going into each bin. All this means is instead of using our raw counts, we use a normalized count of items in a bin.&lt;/p&gt;

&lt;p&gt;We end up with this expression:&lt;/p&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;H(p) = -\sum_i p(x_i) \log p(x_i)&lt;/script&gt;

&lt;p&gt;The log base doesn’t matter but for these examples we will use 2.
Entropy is proportional to the number of ways of assigning items into bins. High entropy is associated with many ways of reaching a particular outcome.&lt;/p&gt;

&lt;p&gt;OK, well what does this have to do the normal (Gaussian) distribution?
Basically, the normal distribution is a distribution that maximizes entropy. But before we do that, let’s find the maximum entropy distribution for discrete distributions.&lt;/p&gt;

&lt;p&gt;The entropy for the example is &lt;script type=&quot;math/tex&quot;&gt;-(1/2*-1 + 1/6*-2.6 + 1/3*-1.6) = 1.47&lt;/script&gt;.
Let’s try moving one item from bin 3 to bin 1. The entropy becomes &lt;script type=&quot;math/tex&quot;&gt;1.26&lt;/script&gt; (corresponding to 30 ways). This also happens to be the most uneven distribution, so maybe it has something to do with that?
Testing our hypothesis, the most even distribution has an entropy of &lt;script type=&quot;math/tex&quot;&gt;1.6&lt;/script&gt; (corresponding to 90 ways).&lt;/p&gt;

&lt;p&gt;OK, let’s try to generalize this by proving the uniform distribution is the maximum entropy distribution using standard stuff learned in Calc III.&lt;/p&gt;

&lt;p&gt;So the function we want to maximize is &lt;script type=&quot;math/tex&quot;&gt;f(p_1,...,p_n) = -\sum_i^n p(x_i) \log p(x_i)&lt;/script&gt; subject to the constraint &lt;script type=&quot;math/tex&quot;&gt;g(p_1,...,p_n) = \sum_i^n p(x_i) = 1&lt;/script&gt;
That is, the probabilities have to sum up to one.&lt;/p&gt;

&lt;p&gt;You might be thinking how can we take a derivative with respect to a function. The reason is the random variable only takes on a finite number of values, and so we can
treat each value as a variable.&lt;/p&gt;

&lt;p&gt;The Lagrange function is &lt;script type=&quot;math/tex&quot;&gt;L(\bf{p}, \lambda) = f(p_1,...,p_n) - \lambda (g(p_1,...,p_n)-c)&lt;/script&gt;&lt;/p&gt;

&lt;p&gt;We get n equations from &lt;script type=&quot;math/tex&quot;&gt;\frac{\partial f(\bf p)}{\partial p_i} =  -(\log p(x_i)+1)&lt;/script&gt; using the Chain Rule, and &lt;script type=&quot;math/tex&quot;&gt;\frac{\partial g(\bf p)}{\partial p_i} = -\lambda&lt;/script&gt;&lt;/p&gt;

&lt;p&gt;Putting it together &lt;script type=&quot;math/tex&quot;&gt;\frac{L(\bf p, \lambda)}{\partial p_i} = -(\log p(x_i)+1) -\lambda = 0&lt;/script&gt;&lt;/p&gt;

&lt;p&gt;We find that &lt;script type=&quot;math/tex&quot;&gt;p(x_i)&lt;/script&gt; is constant, and so it must be equal for all values of &lt;script type=&quot;math/tex&quot;&gt;x_i&lt;/script&gt;&lt;/p&gt;

&lt;p&gt;Therefore &lt;script type=&quot;math/tex&quot;&gt;\sum_i p(x_i) = np(x) = 1&lt;/script&gt;, and so &lt;script type=&quot;math/tex&quot;&gt;p(x) = \frac{1}{n}&lt;/script&gt;&lt;/p&gt;

&lt;p&gt;What about continuous distributions? The natural first choice is the continuous uniform distribution, where X takes on an infinite number of values.
However, we have to be careful about what we say here. It only makes sense to compare distributions with equal variance &lt;script type=&quot;math/tex&quot;&gt;Var(X) = (X-E(X))^2&lt;/script&gt;, which puts a constraint on the space of distributions.&lt;/p&gt;

&lt;p&gt;We can essentially apply the same analysis to normal distributions. Whereas before we were solving for a finite set of values, we now need to use calculus of variations.
I’ll cover this more in depth in a seperate post another time.&lt;/p&gt;

&lt;p&gt;For continuous distributions, we use something of an analogous form called differential entropy.&lt;/p&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;-\int p(x) \log p(x) dx&lt;/script&gt;

&lt;p&gt;Same Lagrange function setup as before, except now there’s two constraints.&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;Definition of a probability density function: $$ \int p(x) dx = 1 $$&lt;/li&gt;
	&lt;li&gt;Definition of variance: $$ \int p(x) (x-\mu)^2 dx = \sigma^2 $$&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We can manipulate (2) to &lt;script type=&quot;math/tex&quot;&gt;\int p(x) x^2 dx = \mu^2 + \sigma^2&lt;/script&gt;&lt;/p&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;L(p, \lambda) = -\int p(x) \log p(x) dx - \lambda_1(\int p(x) dx - 1) - \lambda_2(\int p(x) x^2 dx - (\mu^2 + \sigma^2))&lt;/script&gt;

&lt;p&gt;Our integrals are of the form &lt;script type=&quot;math/tex&quot;&gt;F(y) = \int f(x, y(x), y'(x)) dx&lt;/script&gt; where &lt;script type=&quot;math/tex&quot;&gt;y(x) = p(x)&lt;/script&gt;.&lt;/p&gt;

&lt;p&gt;The Lagrange equation &lt;script type=&quot;math/tex&quot;&gt;\frac{\partial F(y)}{\partial f(y)} = \frac{\partial f}{\partial y} - \frac{d}{dx} \frac{\partial L}{\partial y'}&lt;/script&gt; can then be applied directly. There is no &lt;script type=&quot;math/tex&quot;&gt;y'(x)&lt;/script&gt; in the integral so we get&lt;/p&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\frac{\partial L(p(x), \lambda)}{\partial p(x)} = 1 - \log p(x) + \lambda_1 - \lambda_2 x^2 = 0&lt;/script&gt;

&lt;p&gt;Solve for &lt;script type=&quot;math/tex&quot;&gt;p(x)&lt;/script&gt; to get &lt;script type=&quot;math/tex&quot;&gt;p(x) = \exp(\lambda_1-\lambda_2 x^2 + 1)&lt;/script&gt;&lt;/p&gt;

&lt;p&gt;We can now solve for &lt;script type=&quot;math/tex&quot;&gt;\lambda_1&lt;/script&gt; and &lt;script type=&quot;math/tex&quot;&gt;\lambda_2&lt;/script&gt; from our constraints, plug it in and rearrange to get the normal distribution
&lt;script type=&quot;math/tex&quot;&gt;p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \frac{(x-\mu)^2}{2\sigma^2}&lt;/script&gt;&lt;/p&gt;

&lt;p&gt;So out of the distributions parameterized by equal variance &lt;script type=&quot;math/tex&quot;&gt;\sigma^2&lt;/script&gt;, the normal distribution has the highest entropy.&lt;/p&gt;

&lt;p&gt;To compare distributions of equal variance we can use the &lt;a href=&quot;https://en.wikipedia.org/wiki/Generalized_normal_distribution&quot;&gt;generalized normal distribution&lt;/a&gt; 
We vary &lt;script type=&quot;math/tex&quot;&gt;\beta&lt;/script&gt; while keeping &lt;script type=&quot;math/tex&quot;&gt;\sigma^2&lt;/script&gt; fixed to get &lt;script type=&quot;math/tex&quot;&gt;\alpha&lt;/script&gt;. &lt;script type=&quot;math/tex&quot;&gt;\beta = 2&lt;/script&gt; for the Gaussian distribution.
You can see in the figure below, where in the legend first number is &lt;script type=&quot;math/tex&quot;&gt;\beta&lt;/script&gt; and second is the entropy, that the Gaussian has the highest entropy.&lt;/p&gt;

&lt;div id=&quot;gennorm&quot;&gt;&lt;/div&gt;

&lt;p&gt;What are the implications of being the maximum entropy distribution with finite variance? If we have some data and we don’t know anything about it, besides its sample mean and variance our best bet is the normal distribution because it is the most likely distribution.&lt;/p&gt;

&lt;h3 id=&quot;references&quot;&gt;References:&lt;/h3&gt;
&lt;p&gt;Time to give credits where its due&lt;/p&gt;

&lt;p&gt;Much of the bin exposition is taken from Bishop which actually comes from Graham Wallis as I learned from Statistical Rethinking (how much of the knowledge one posseses is original??).
Inspiration for generalized normal was taken from Statistical Rethinking.&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.&lt;/li&gt;
	&lt;li&gt;McElreath, Richard. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC press, 2020.&lt;/li&gt;
&lt;/ol&gt;</content><author><name></name></author><summary type="html">I thought it’d be a good idea to start the blog with a probability distribution that appears everywhere from distribution of height to blood pressure to SAT scores. It looked monstrous the first time I saw it and I ended up having to memorize it for tests.</summary></entry></feed>