Before we start, forget everything you know about probability. Try and read this with a fresh mind.

#### Why is Probability Useful?

In ML the aim is to infer useful information from collected data. Real world data is inherently corrupted by noise, which may result from stochastic processes or due to unobserved variability. Therefore, the information inferred is always uncertain.

Probability theory, allows us to quantify that uncertainty. It can be seen as a “measure” of uncertainty. Decision theory, then allows us to use this probabilistic framework to make optimal choices, even in the face of ambiguity. That is a very powerful thing.

A lot of this discussion has been taken from Pattern Recognition and Machine Learning by Christopher Bishop. The book is fantastic! It is aimed at the more mathematically inclined but regardless, for me, it is one of the best books on ML out there.

Let’s begin!

#### Ooh The Randomness

First we have to define a random variable. A random variable is a variable whose value is the outcome of an apparently random process. It can be the outcome of any experiment. Imagine the scenario where you have two boxes, shown in the figure below.

An experiment could be:

• Choose a box at random
• Pick a ball at random

The first experiment would lead to either, the red/blue box being chosen (random variable $\smash{\mathit{X}}$). The second experiment would lead to either an orange/green ball being chosen (random variable $\smash{\mathit{Y}}$). By convention, random variables are uppercase. Specific values of random variables are given as lowercase i.e. $\smash{\mathit{X = x}}$.

For the boxes and balls example, the random variable can be seen as discrete. It can only take on a finite set of values. This notion can easily be extended to continuous random variables, example temperature, pressure or height. In that case $\smash{\mathit{x\: \epsilon\: \mathbb{R}}}$.

Now we have an understanding of what a random variable is! Next up, probability!

#### Cogs of Probability

Probability is a measure. Just like any measure we have i.e. the meter, the second, the ampere, we have to define probability. For this article, probability is defined as the fraction of times an event $\smash{\mathit{A}}$ occurs out of the total number of occurrences. Shown below:

$\mathit{P(A) = \frac{n_A}{N}}$

We also state that the $\smash{\mathit{1 \geq P(A) \geq 0}}$. Therefore, we constrain the values of probability to lie between 0 and 1. A probability of 1 represents the scenario, where you are absolutely certain about something.

Imagine you have two random variables, $\smash{\mathit{X,Y}}$, where $\smash{\mathit{x,y\: \epsilon\: \mathbb{R}}}$.

Let the number of times $\smash{\mathit{X = x_i}}$ be $\smash{\mathit{c_i}}$ and the number times $\smash{\mathit{Y = y_j}}$ be $\smash{\mathit{r_j}}$. Finally, let the number of times that $\smash{\mathit{X = x_i}}$ & $\smash{\mathit{Y = y_j}}$ happening together be $\smash{\mathit{n_{i\, j}}}$.

This can be sorted graphically as shown in Figure 2 below.

According to our definition of probability, we can define the following probabilities:

$\mathit{P(X = x_i) = \frac{c_i}{N} ; P(Y = y_j) = \frac{r_j}{N}}\\[2mm] \mathit{P(X = x_i,Y = y_j) = \frac{n_{i\,j}}{N}}\\[2 mm] \mathit{P(Y=y_j|X=x_i) = \frac{n_{i\,j}}{c_i}}$

The value $\smash{\mathit{P(X,Y)}}$ is known as the Joint Probability and expresses the chance that both $\smash{\mathit{X}}$ & $\smash{\mathit{Y}}$ happen together. Further, the value $\smash{\mathit{P(Y|X)}}$ is known as the Conditional Probability, and refers to the chance of $\smash{\mathit{Y}}$ happening after having observed $\smash{\mathit{X}}$.

From these definitions we can obtain two very important rules known as the Sum and Product Rules. They are fundamental and are utilised heavily in probabilistic inference.

From Figure 2 we can see that the Sum Rule, also referred to as marginalisation, can be defined as:

$\mathit{P(X) = \sum_Y P(X,Y) = \sum_j \frac{n_{i\,j}}{N}}$

By observation we can also define a Product Rule:

$\mathit{P(X,Y) = \frac{n_{i\,j}}{N} = \frac{n_{i\,j}}{c_i} * \frac{c_i}{N} = P(Y|X)*P(X)}$

Using the property of symmetry i.e. $\smash{P(X,Y) \equiv P(Y,X)}$ we can define Bayes Theorem:

$\mathit{P(X|Y)*P(Y) = P(Y|X)*P(X)}\\[5 mm] \therefore \mathit{P(Y|X) = \dfrac{P(X|Y)*P(Y)}{P(X)}}$

Finally, using the Sum Rule we can express the denominator of Bayes Theorem as elements of the numerator:

$\mathit{P(Y|X) = \dfrac{P(X|Y)*P(Y)}{\sum_Y P(X|Y)*P(Y)}}$

Bayes Theorem is an extremely useful equation for a bunch of reasons! Specific to Bayes Theorem, the values $\smash{\mathit{P(Y|X), P(X|Y)}}$ and $\smash{\mathit{P(Y)}}$ are known as the posterior, likelihood and prior probabilities respectively. The denominator is known as the normalisation probability.

#### What’s next?

Well, that was a lot of information! Good job on sticking through all of it, you’re on the path to understanding a very powerful concept. As I’m still a student, please comment on things I may have missed out or areas that could be made clearer.

Swing on over to Part 2 for how we can expand these concepts to distributions and densities. Therein lies the fruits of probability theory!