Probability! On Nature’s Uncertainty: Part 2

Probability! On Nature’s Uncertainty: Part 2

Well hello there!

If you landed here by chance, I suggest you read Part 1 before continuing. It’s got some juicy bits!

As we saw in the previous article, probability is a measure of uncertainty. However, that article was limited to a discussion of discrete events or random variables i.e. picking a blue or red box, an orange or green ball etc. We now extend this notion to continuous variables such as banana weights, hut temperatures, pressures, amount of rainfall in the jungle etc.

Our first tool is the Cumulative Distribution Function!

Cumulative Distribution Functions

Continuous random variables can take on any value in the set of real numbers i.e. \smash{\mathit{X\: \epsilon\: \mathbb{R}}}. The problem however, is that probability lies between 0 and 1.

As a result, we need a function that squashes down the set of real numbers to the interval 0 and 1. That is where we define the Cumulative Distribution Function (CDF). For the purposes of this article a CDF of \mathit{X} will be denoted as \mathit{F_X(x)}.

Any \mathit{F_X(x)} has the following properties:

  1. \mathit{F_X(x) \colon \mathbb{R} \mapsto [0, 1]}
  2. \mathit{F_X(x)} is the \mathit{P(X \leq x)}

It is particularly important to note that the CDF gives the probability that a random variables is less than or equal to a specific value. Popular CDFs are presented in Figure 1 below.

Figure 1: Outlining Common CDFs

The Logistic Sigmoid is an important CDF as it represents a lot of natural processes in the universe. Other examples include the Exponential CDF or the Uniform CDF.

Probability Density Functions

Another, way of representing the CDF is through its derivative.

The derivative is called the Probability Density Function (PDF) and it represents our second tool for probability. For the purposes of this article a PDF of \mathit{X} will be denoted as \mathit{f_X(x)}.

Any \mathit{f_X(x)} has the following properties:

  1. \mathit{f_X(x) = \frac{dF_X(x)}{dx}}
  2. \mathit{f_X(x) \geq 0}
  3. \mathit{\int\limits_{-\infty}^\infty f_X(x)dx = 1}

It is important to note that the PDF doesn’t give the probability directly. It’s value can go above 1. However, the integral of the PDF with respect to any value will be at most 1, and correspond to the CDF at that point.

PDFs only exists when the corresponding CDF is continuously differentiable everywhere! Basically, the CDF can’t have any sudden jumps. An example of a non-continuous function is the heavy-side step function.

Examples of popular PDFs are the Normal, Uniform and Exponential PDFs. These have all been produced by differentiating their respective CDFs! They are shown below in Figure 2.

Figure 2: Outlining Common PDFs

The Gaussian PDF is extremely important. There is a well known theorem called the Central Limit Theorem which outlines its practicality. It says, if you collect enough data points, assuming they are all independent and identically distributed, the distribution of the averages will tend to look Gaussian! It’s pretty remarkable.

Figure 3 outlines this phenomenon. First, I took the average of varying lengths of data (5 – 10000 points), all generated from a Uniform distribution. I then plotted histograms of these averages. You can see that the result is pretty much Gaussian! (1)
Figure 3: Showing the Central Limit Theorem

The MATLAB code I implemented is presented below. Run it and tweak the distributions. It works for any of them!

close all
clear all

point = logspace(0.69,4,500);
mn = [];
for i=1:length(point)
    data = rand(round(point(i)),1);
    mn(i) = sqrt(length(data))*(mean(data)-0.5);
    if i > 2
        title(['No. Points: ' num2str(round(point(i)))])
        grid on

Anyway, enough of that. We now know about PDFs and CDFs! Those are two very important concepts in probability.

Cogs of Probability: With Continuous Variables

This section borrows heavily from Part 1.

All of the good stuff we learnt from Part 1 can be used with PDFs and CDFs. So there is nothing new here. It’s just a bunch of equations that are quite analogous to their discrete equivalents.

The Sum Rule can be defined as:

 \mathit{f_X(x) = \int\limits_{-\infty}^\infty f_{X,Y}(x,y)dy}

In this equation the value \mathit{f_{X,Y}(x,y)} is known as the Joint PDF of both \mathit{X} & \mathit{Y}.

The Product Rule can be defined as:

\mathit{f_{X,Y}(x,y) = f_{Y|X}(y|x)*f_X(x)}

The value \mathit{f_{Y|X}(x,y)} is known as the Conditional PDF of \mathit{Y} given \mathit{X}.

Finally, Bayes Theorem can be fully represented as:

\mathit{f_{Y|X}(y|x) = \dfrac{f_{X|Y}(x|y)*f_Y(y)}{\int\limits_{-\infty}^\infty f_{X|Y}(x|y)*f_Y(y)dy}}

These equations might look intimidating, but they are actually very straightforward. A comparison with the equations in Part 1 will emphasise their definitions.

An important note is that these quantities, unlike in Part 1, are now functions. That is what sets them apart and makes them vastly more powerful!

Expectation: What do you tend to?

Expectation is another term for expected value. Conceptually, it is the average outcome of an experiment, if you run the experiment a large number of times. In terms of probability it is simply a weighted average of the outcomes of an experiment.

This works because probability is defined as event occurrences over an infinite number of observations! Therefore, if you have a PDF, you don’t need to run an infinite number of experiments; just use your PDF to figure out what your result will be, on average.

Assume you have a random variable \mathit{X} that naturally follows a PDF \mathit{f_X(x)}. The expectation of \mathit{X} can be defined as:

\mathit{\mathbb{E}[X] = \int\limits_{-\infty}^\infty x*f_{X}(x)dx}

The experiment in this case is observing \mathit{X}. It could be anything temperature, pressure, etc. Suppose you know that the temperature in a room follows a Gaussian PDF. You can just collect a finite number of samples i.e. \mathit{x} and use the Expectation formula to figure out what the most likely value for \mathit{x} will be.

Expectation is a description of location in statistics. Another very common descriptor is variance and can be defined as:

\mathit{var[X] = \mathbb{E}[X^2] - \mathbb{E}[X]}

The variance of a random variable describes how spread the various points are from the average value.

What’s the point? Why go through all of this?

Good lord! That was a lot of probability. Well done for sticking with it though. It’s a tough concept to visualise, which makes it tough to understand. I wrote this whole thing and I’m still unsure about some stuff. But anyway, let me motivate the why!

When we run experiments, the results are always corrupted by noise. Noise creeps in from everywhere; sensors, human error, etc. We hate noise. Noise means that we can never be completely sure of our answer. A probabilistic framework allows us to capture all this uncertainty in a numerical way!

In a field like ML, where the focus is data, noise is an implicit characteristic. Understanding probability allows you to wield the power of Machine Learning in the right way. Understanding the potential of the tools you’ve been given, is a very powerful thing.

As always, thanks for having a read! Please comment and let me know if I’ve missed anything out or if anything could have been made clearer.

Happy swinging!

Further Resources

Probability: For a more mathematical and rigorous treatment of the concepts in Parts 1&2, I’ve found this document from the Standford CS229 ML course extremely helpful. It does motivate things from the perspective of sets followed by random variables, but It’s not too difficult.

Probability Theory: A particularly good blog on Probability and Statistics is by Dan Ma. You can find it here. A quick search for things to do with expectation, bayes theorem etc, will give you quite a lot of information.

Expected Value: Go over the “Definition” section over here. It goes through a numerical example Expectation.


Probability! On Nature’s Uncertainty: Part 1

Probability! On Nature’s Uncertainty: Part 1

This article is a primer on probability. The aim isn’t to outline an extensive review of Probability Theory, but to provide the reader with the necessary understanding required for ML!

Before we start, forget everything you know about probability. Try and read this with a fresh mind.

Why is Probability Useful?

In ML the aim is to infer useful information from collected data. Real world data is inherently corrupted by noise, which may result from stochastic processes or due to unobserved variability. Therefore, the information inferred is always uncertain.

Probability theory, allows us to quantify that uncertainty. It can be seen as a “measure” of uncertainty. Decision theory, then allows us to use this probabilistic framework to make optimal choices, even in the face of ambiguity. That is a very powerful thing.

A lot of this discussion has been taken from Pattern Recognition and Machine Learning by Christopher Bishop. The book is fantastic! It is aimed at the more mathematically inclined but regardless, for me, it is one of the best books on ML out there.

Let’s begin!

Ooh The Randomness

First we have to define a random variable. A random variable is a variable whose value is the outcome of an apparently random process. It can be the outcome of any experiment. Imagine the scenario where you have two boxes, shown in the figure below.

Figure 1: Boxes and Balls – Outlining Random Variables

An experiment could be:

  • Choose a box at random
  • Pick a ball at random

The first experiment would lead to either, the red/blue box being chosen (random variable \smash{\mathit{X}}). The second experiment would lead to either an orange/green ball being chosen (random variable \smash{\mathit{Y}}). By convention, random variables are uppercase. Specific values of random variables are given as lowercase i.e. \smash{\mathit{X = x}}.

For the boxes and balls example, the random variable can be seen as discrete. It can only take on a finite set of values. This notion can easily be extended to continuous random variables, example temperature, pressure or height. In that case \smash{\mathit{x\: \epsilon\: \mathbb{R}}}.

Now we have an understanding of what a random variable is! Next up, probability!

Cogs of Probability

Probability is a measure. Just like any measure we have i.e. the meter, the second, the ampere, we have to define probability. For this article, probability is defined as the fraction of times an event \smash{\mathit{A}} occurs out of the total number of occurrences. Shown below:

\mathit{P(A) = \frac{n_A}{N}}

We also state that the \smash{\mathit{1 \geq P(A) \geq 0}}. Therefore, we constrain the values of probability to lie between 0 and 1. A probability of 1 represents the scenario, where you are absolutely certain about something.

Imagine you have two random variables, \smash{\mathit{X,Y}}, where \smash{\mathit{x,y\: \epsilon\: \mathbb{R}}}.

Let the number of times \smash{\mathit{X = x_i}} be \smash{\mathit{c_i}} and the number times \smash{\mathit{Y = y_j}} be \smash{\mathit{r_j}}. Finally, let the number of times that \smash{\mathit{X = x_i}} & \smash{\mathit{Y = y_j}} happening together be \smash{\mathit{n_{i\, j}}}.

This can be sorted graphically as shown in Figure 2 below.

Figure 2: Sorting Random Variables

According to our definition of probability, we can define the following probabilities:

\mathit{P(X = x_i) = \frac{c_i}{N} ; P(Y = y_j) = \frac{r_j}{N}}\\[2mm] \mathit{P(X = x_i,Y = y_j) = \frac{n_{i\,j}}{N}}\\[2 mm] \mathit{P(Y=y_j|X=x_i) = \frac{n_{i\,j}}{c_i}}

The value \smash{\mathit{P(X,Y)}} is known as the Joint Probability and expresses the chance that both \smash{\mathit{X}} & \smash{\mathit{Y}} happen together. Further, the value \smash{\mathit{P(Y|X)}} is known as the Conditional Probability, and refers to the chance of \smash{\mathit{Y}} happening after having observed \smash{\mathit{X}}.

From these definitions we can obtain two very important rules known as the Sum and Product Rules. They are fundamental and are utilised heavily in probabilistic inference.

From Figure 2 we can see that the Sum Rule, also referred to as marginalisation, can be defined as:

\mathit{P(X) = \sum_Y P(X,Y) = \sum_j \frac{n_{i\,j}}{N}}

By observation we can also define a Product Rule:

\mathit{P(X,Y) = \frac{n_{i\,j}}{N} = \frac{n_{i\,j}}{c_i} * \frac{c_i}{N} = P(Y|X)*P(X)}

Using the property of symmetry i.e. \smash{P(X,Y) \equiv P(Y,X)} we can define Bayes Theorem:

\mathit{P(X|Y)*P(Y) = P(Y|X)*P(X)}\\[5 mm] \therefore \mathit{P(Y|X) = \dfrac{P(X|Y)*P(Y)}{P(X)}}

Finally, using the Sum Rule we can express the denominator of Bayes Theorem as elements of the numerator:

\mathit{P(Y|X) = \dfrac{P(X|Y)*P(Y)}{\sum_Y P(X|Y)*P(Y)}}

Bayes Theorem is an extremely useful equation for a bunch of reasons! Specific to Bayes Theorem, the values \smash{\mathit{P(Y|X), P(X|Y)}} and \smash{\mathit{P(Y)}} are known as the posterior, likelihood and prior probabilities respectively. The denominator is known as the normalisation probability.

What’s next?

Well, that was a lot of information! Good job on sticking through all of it, you’re on the path to understanding a very powerful concept. As I’m still a student, please comment on things I may have missed out or areas that could be made clearer.

Swing on over to Part 2 for how we can expand these concepts to distributions and densities. Therein lies the fruits of probability theory! 

Monkey See. Monkey Do.

Monkey See. Monkey Do.

I’ve always wanted to start a blog, but just never known where to begin or what to write about. So, as you do, I signed up for a WordPress account with the username mlmonkey29. Why mlmonkey29? Don’t ask me, just go with it.

This blog is aimed at improving my, and perhaps other monkeys’, understanding and interest in Machine Learning.

I don’t really know what I’m going to post on here yet, but hopefully it will be fun and we just might figure out the real meaning behind “Monkey See. Monkey Do”. For now, here’s a picture of a monkey – Enjoy!