Last time we talk about what is a Markov chain. However there is one big limitation:
A Markov chain implies that we can directly observe the state of the process. (e.g the number of people in the queue).
Many times we can only access an indirect representation or noisy measure of the state of the system. (e.g. we know the noisy GPS coordinates of a robot but we want to know it’s real position).

In this post we’re going to focus on the second point and see how to deal with HMMs. In fact HMM can be useful every time that we don’t have direct access to the system state. Let’s take some motivational examples first before we dig into the maths.
- Noisy coordinates of a subject’s position: we have access to the measured coordinates but we want to know the real position of the subject
- Hand-writing recognition: we have the images of characters and want to know the actual letters represented.
- Financial analysis: we have access to to the noisy variation of a stock price but we want to know if we should buy, sell, or hold.
- Remember the gym session example? You don’t know which type of session I am doing the only thing you can observe is the exercise I am doing but you want to know the type of session it corresponds to (cardio, strength, …)
- ….
Whatever we choose it always goes done to the same diagram that is know as a Trellis diagram. In fact understanding these diagrams is quite important as most of the properties we’re going to see can be inferred from it.
A Hidden Markov Model (HMM) is defined by
- a set of states \(S = s_1, s_2, …, s_m\)
- a probability distribution for the initial state \(\pi\)
- a transition model \(T(s,s’)\)
- an emission model \(\varepsilon(s,x)\)
- a set of observations \(X = x_1, x_2, … x_n\)
Note:
- Every \(z_k \in S\). It means that every \(z_k\) equals one of \(s_1, s_2, …, s_m\).
- \(X\) can take values in any domain \(\chi\) (e.g \(\chi = \mathbb{R}\) if all the \(x_k\) are real numbers).
- A the system is hidden we don’t know the initial state so we assume a probability distribution \(\pi\) for the initial state
If this is clear let’s start with the joint distribution:
\(\begin{align}
p(x_1, …, x_n, z_1, …, z_n) = p(x_1|z_1)p(z_1) \prod_{k=2}^n p(x_k|z_k) p(z_k|z_{k-1})
\end{align}
\)
According to the definition of the HMM we have all the information we need to compute this joint distribution:
- \(p(z_1)\) is given by the initial distribution \(\pi(s_i) = p(z_1 = s_i)\)
- \(p(z_k|z_{k-1})\) is given by the transition model \(T(s_i,s_j) = p(z_k=s_i|z_{k-1}=s_j)\)
- \(p(x_k|z_k)\) is given by the emission distribution \(\varepsilon_{s_i}(x) = p(x_k=x|z_k=s_i)\)
This factorisation of the joint distribution is the key to the Markov model. All the inference we can make are based on this factorisation as we’re going to see.
The forward/backward algorithm
This is the main algorithm for performing inference on a Markov model. Remember that with a HMM we are interested in the real state of the system but unfortunately we cannot observe it. Instead we observe a variable \(x_k\) that depends from the state \(z_k\). The forward/backward algorithm gives us access to the hidden state given an observed sequence of events \(X\). It allows us to compute the probability of the state at step k given a sequence of events \(p(z_k|x1,…,x_n)\).

As its name says it contains 2 parts a forward part and a backward part:
- The forward part computes \(p(z_k,x_1, …, x_k)\)
- The backward part computes \(p(x_{k+1},…,x_n|z_k)\)
Finally we have to combine these parts together to get \(p(z_k|x1,…,x_n)\). The trick is to not that \(p(z_k|x_1,…,x_n)\) is actually proportional to \(p(z_k,x_1,…,x_n)\).
We can write it as:
\(\begin{align}
p(z_k|x_1,…,x_n) \propto p(z_k,x_1,…,x_n) & = p(x_{k+1},…,x_n|z_k,x_1,…,x_k)p(z_k,x_1,…,x_k) \\
& = p(x_{k+1},…,x_n|z_k)p(z_k,x_1,…,x_k)
\end{align}
\)
The first line is simply the Baye’s rule on conditional probabilities and the second line relies on the fact that \(x_{k+1},…,x_n\) depends only on \(z_k\).
That’s great because that’s exactly the 2 parts of the forward/backward algorithm. So let’s start with the forward algorithm.
Forward algorithm
The main idea with Markov model is to look for a recursion so that we can propagate the information in subsequent stages.
\(\begin{align}
p(z_k,x_1,…,x_k) & = \sum_{z_{k-1}=s_1}^{s_m} p(z_k,z_{k-1},x_1,…,x_k) \\
& = \sum_{z_{k-1}=s_1}^{s_m} p(x_k|z_k,z_{k-1},x_1,…,x_{k-1})p(z_k|z_{k-1},x_1,…,x_{k-1})p(z_{k-1},x_1,…,x_{k-1}) \\
& = \sum_{z_{k-1}=s_1}^{s_m} p(x_k|z_k)p(z_k|z_{k-1})p(z_{k-1},x_1,…,x_{k-1})
\end{align}
\)
How did we get there?
- On line 1 we introduce \(z_{k-1}\) into the expression. As the expression doesn’t dependent on \(z_{k-1}\) we sum up over all possible values of \(z_{k-1}\)
- On line 2 we just use the Baye’s rule to decompose the expression with conditional probabilities
- On line 3 we use the Markov properties to remove independent variables from the conditionals
As you’ve probably already notice we event have our recursion defined
\(\begin{align}
\alpha_k(z_k) & = p(z_k,x_1,…,x_k) \\
& = \sum_{z_{k-1}=s_1}^{s_m} p(x_k|z_k)p(z_k|z_{k-1})p(z_{k-1},x_1,…,x_{k-1}) \\
& = \sum_{z_{k-1}=s_1}^{s_m} p(x_k|z_k)p(z_k|z_{k-1})\alpha_{k-1}(z_{k-1})
\end{align}
\)
How do we solve it? First \(p(x_k|z_k)\) is known (emission distribution) and \(p(z_k|z_{k-1})\) is also known (transition model).
We only have to solve the recursion. Let’s start with the initial case
\(\begin{align}
\alpha_1(z_1) & = p(z_1,x_1) \\
& = p(x_1|z_1)p(z_1)
\end{align}
\)
We can now compute \(\alpha_1(z_1)\). From that we can compute \(\alpha_2(z_2)\), then \(\alpha_3(z_3)\) and so on up to \(\alpha_k(z_k)\). Now you see how the computation moves forward from 1 to k, hence the algorithm name.
The backward algorithm
The backward follows a similar idea where the trick is to introduce a recursion to express \(p(x_{k+1},…,x_n|z_k)\).
\(\begin{align}
p(x_{k+1},…,x_n|z_k) & = \sum_{z_{k+1}=s_1}^{s_m} p(x_{k+1}, …, x_n, z_{k+1}|z_k) \\
& = \sum_{z_{k+1}=s_1}^{s_m} p(x_{k+2}, …, x_n|z_{k+1},z_k,x_{k+1}) p(x_{k+1}|z_k,z_{k+1}) p(z_{k+1}|z_k) \\
& = \sum_{z_{k+1}=s_1}^{s_m} p(x_{k+2}, …, x_n|z_{k+1}) p(x_{k+1}|z_{k+1}) p(z_{k+1}|z_k)
\end{align}
\)
- On the first line we introduced \(z_{k+1}\) by summing over all its possible values as it was not in the equation before.
- On the second line we used the Baye’s rule to condition on \(z_{k+1}\) and \(x_{k+1}\)
- On the third line we used the Markov properties to simplify the conditioning
This gives us the recursion we were looking for
\(\begin{align}
\beta_k(z_k) & = p(x_{k+1},…,x_n|z_k) \\
& = \sum_{z_{k+1}=s_0}^{s_m} p(x_{k+1}, …, x_n, z_{k+1}|z_k) \\
& = \sum_{z_{k+1}=s_1}^{s_m} p(x_{k+2}, …, x_n|z_{k+1}) p(x_{k+1}|z_{k+1}) p(z_{k+1}|z_k) \\
& = \sum_{z_{k+1}=s_1}^{s_m} \beta_{k+1}(z_{k+1}) p(x_{k+1}|z_{k+1}) p(z_{k+1}|z_k)
\end{align}
\)
We need to compute the limit case for \(n-1\):
\(\begin{align}
\beta_{n-1}(z_{n-1}) & = p(x_n|z_{n-1}) \\
& = \sum_{z_n=s_1}^{s_m} p(x_n, z_n|z_{n-1}) \\
& = \sum_{z_n=s_1}^{s_m} p(x_n|z_n,z_{n-1})p(z_n|z_{n-1}) \\
& = \sum_{z_n=s_1}^{s_m} p(x_n|z_n)p(z_n|z_{n-1})
= \sum_{z_n=s_1}^{s_m} \beta_n(z_n) p(x_n|z_n) p(z_n|z_{n-1})
\end{align}
\)
By looking at the last line we can conclude that \(\beta_n(z_n) = 1\) and from there we can move backward by computing \(\beta_{n-1}(z_{n-1})\), \(\beta_{n-2}(z_{n-2})\), … all the way down to \(\beta_k(s_k)\).
Complexity
Let’s have a quick look on the complexity of this algorithm. Both the forward and backward algorithm follow the same pattern and have the same complexity so we can just analyse the forward algorithm and draw the same conclusion for the backward part.
The forward algorithm was \(\alpha_k(z_k) = \sum_{z_{k-1}=s_1}^{s_m} p(x_k|z_k)p(z_k|z_{k-1})\alpha_{k-1}(z_{k-1})\).
- How many \(\alpha\) do we have? \(n\)
- For each \(\alpha_k\) how many \(z_k\) do we have? \(m\) (because \(z_k\) takes its values in \(S = s_1, s_2, …, s_m\)).
- For each \(\alpha_k, z_k\) how many terms do we have in the sum? \(m\) again.
So the total complexity is \(\Theta(nm^2)\). This might not seem terrific in terms of complexity but if you compare it against the naive case which has a complexity of \(\Theta(m^n)\), it turns out to be pretty impressive.
Just let’s take a simple example where we make 100 observations of a system which has 10 different states. With the naive approach we have a complexity of \(\Theta(m^n) = \Theta(10^100)\) and that’s huge!
With the Forward/Backward algorithm the complexity is simply \(\Theta(100 \times 10^2) = \Theta(10^4)\) and that’s a huge improvement (of several orders of magnitude).
Why is it so efficient? The idea is because the recursion allows us to re-use the things we have already computed previously.
The Viterbi algorithm
The Forward/Backward algorithm is great to infer the values of \(z_k\) given a set of observations \(x_1,…,x_n\). However it doesn’t allow to find the most probable sequence of \(z\).
And this exactly what the Viterbi algorithm does.
We still assume that the observed \(x\) and distributions are known. The goal of the algorithm is to find to most likely sequence of \(z\) – let’s call it \(z^*\). We can write it as
\(z^* = \operatorname{argmax}_z p(z|x)
\)
where
\(\begin{align}
z & = z_1,…,z_n \\
x & = x_1,…,x_n
\end{align}
\)
Since \(p(z|x) \propto p(z,x)\) we have \(\operatorname{argmax}_z p(z|x) = \operatorname{argmax}_z p(z,x)\).
Instead of working with the \(\operatorname{argmax}\) we’re going to focus on the \(/max\) (because we can just track for which value of \(z\) we get the \(\max\)).
That means we are now interested in \(\max_z p(x,z)\) and as we did with the Forward/Backward algorithm we’re going to introduce a recursion in this definition. So let’s see what we can do. So we have \(\max_z p(x,z) = \max_{z_1,…,z_n} p(x_1, …, x_n, z_1,…,z_n)\) and we need to introduce \(z_k\) in order to define a recursion.
\(\begin{align}
\mu_k(z_k) & = \max_{z_1,…,z_{k-1}} p(x_1, …, x_k, z_1,…,z_k) \\
& = \max_{z_1,…,z_{k-1}} p(x_k|x_1,…,x_{k-1},z_1,…,z_k)p(z_k|x_1,…,x_{k-1},z_1,…,z_{k-1})p(x_1,…,x_{k-1},z_1,…,z_{k-1}) \\
& = \max_{z_1,…,z_{k-1}} p(x_k|z_k)p(z_k|z_{k-1})p(x_1,…,x_{k-1},z_1,…,z_{k-1})
\end{align}
\)
As usual we have used the Baye’s rule to condition and then simplify using the independence properties.
Now we need to move the \(\max\) inside the expression to express \(\mu_k(z_k)\) in terms of \(\mu_{k-1}(z_{k-1})\).
For that we rely on the following property:
\(\begin{align}
\max_{a,b} f(a)g(a,b) = \max_a \left(f(a) \max_b g(a,b)\right) \\
\forall_a f(a) \ge 0 \\
\forall_{a,b} g(a,b) \ge 0
\end{align}
\)
and we can now write
\(\begin{align}
\mu_k(z_k) & = \max_{z_1,…,z_{k-1}} p(x_1, …, x_k, z_1,…,z_k) \\
& = \max_{z_1,…,z_{k-1}} p(x_k|z_k)p(z_k|z_{k-1})p(x_1,…,x_{k-1},z_1,…,z_{k-1}) \\
& = \max_{z_{k-1}} \left(p(x_k|z_k)p(z_k|z_{k-1}) max_{z_1,…,z_{k-2}} p(x_1,…,x_{k-1},z_1,…,z_{k-1})\right) \\
& = \max_{z_{k-1}} \left(p(x_k|z_k)p(z_k|z_{k-1}) \mu_{k-1}(z_{k-1})\right)
\end{align}
\)
Now we have our recursion and the limit case is for \(\mu_1(z_1) = p(z_1,x_1) = p(x_1|z_1)p(z_1)\). Starting from here we can move forward all the way up to \(\mu_n(z_n)\) and we have
\(\begin{align}
\mu_n(z_n) & = max_{z_1,…,z_{n-1}} p(x_1,…,x_n,z_1,…,z_n) \\
max_{z_n} \mu_n(z_n) & = max_{z_1,…,z_n} p(x_1,…,x_n,z_1,…,z_n)
\end{align}
\)
If you look back at the diagram at the beginning of this section you should have a clear idea of how it works. It starts from step 1 and figure out the most likely state given \(x_1\). Then it moves on step 2 and computes the most probable sequence for \(z_1, z_2\) given \(x_1\) and \(x_2\), and so on. At each step we rely on the sequences we have computed on the previous step.
The HMM are generic enough and have a wide range of application in many domains. The probabilities computation could make it impressive at first but like the Markov Chain once you get a good intuition of how things work it turns out to be rather accessible.