Hidden Markov Models (HMMs) — Detailed Overview

1. What is a Hidden Markov Model (HMM)?

A Hidden Markov Model (HMM) is a statistical model that is used to represent systems which are governed by a sequence of hidden states that cannot be directly observed (i.e., they are "hidden"), but the output or observations made by the system are influenced by these hidden states.

An HMM assumes the following properties:

  1. Markov Property (Memoryless property): The state of the system at time tt depends only on the state at time t1t-1, i.e., P(stst1,st2,)=P(stst1)P(s_t | s_{t-1}, s_{t-2}, \dots) = P(s_t | s_{t-1}).
  2. Hidden States: The system has a set of hidden states that cannot be directly observed.
  3. Observable Outputs: The system produces an observation (output) at each time step, which is probabilistically related to the hidden state.
  4. State Transitions: The system transitions between these hidden states based on certain transition probabilities.

2. Components of an HMM

An HMM is typically represented by the following components:

  1. States (SS):

    • These are the hidden states in the system.
    • The set of hidden states is typically denoted as S={s1,s2,,sN}S = \{ s_1, s_2, \dots, s_N \}, where NN is the number of states.
    • The states themselves are not directly observable (hence "hidden").
  2. Observations (OO):

    • At each time step tt, an observation oto_t is made.
    • Observations are usually denoted by O={o1,o2,,oT}O = \{ o_1, o_2, \dots, o_T \}, where TT is the length of the observation sequence.
    • The observations are directly influenced by the hidden states and are observable, unlike the states.
  3. State Transition Probabilities (AA):

    • These describe the probability of transitioning from one state to another. The state transition probability matrix AA is defined as: aij=P(st=sjst1=si)a_{ij} = P(s_t = s_j \,|\, s_{t-1} = s_i)
    • The value of aija_{ij} represents the probability of moving from state sis_i at time t1t-1 to state sjs_j at time tt.
    • These transition probabilities are stored in a matrix AA of size N×NN \times N.
  4. Emission Probabilities (BB):

    • These describe the probability of observing a particular symbol given the current state.
    • The emission probability bi(o)b_i(o) is defined as: bi(ot)=P(ot=ost=si)b_i(o_t) = P(o_t = o \,|\, s_t = s_i)
    • This is the probability of observing the symbol oto_t given that the system is in state sis_i.
    • The emission probabilities are typically stored in a matrix BB of size N×MN \times M, where MM is the number of possible observation symbols.
  5. Initial State Probabilities (Ï€\pi):

    • These describe the initial probability distribution of the system’s states.
    • The initial state probability Ï€i\pi_i represents the probability of the system starting in state sis_i.
    • The initial state probabilities are represented by the vector Ï€=[Ï€1,Ï€2,,Ï€N]\pi = [\pi_1, \pi_2, \dots, \pi_N], where i=1NÏ€i=1\sum_{i=1}^{N} \pi_i = 1.

Thus, the HMM parameters are represented as λ=(A,B,π)\lambda = (A, B, \pi).


3. The Core Problem in HMM

The Three Fundamental Problems in HMMs

Given an HMM model λ=(A,B,π)\lambda = (A, B, \pi) and an observation sequence O=(o1,o2,,oT)O = (o_1, o_2, \dots, o_T), there are three key problems we need to solve:


1. Evaluation Problem (Forward Probability)

The evaluation problem involves computing the probability of the observed sequence OO given the model λ\lambda. This helps us understand how likely it is that the sequence OO could have been generated by the HMM.

Task: Compute P(Oλ)P(O \,|\, \lambda).

Forward Algorithm:

To solve the evaluation problem, we use the Forward Algorithm. It calculates the likelihood of the observation sequence by recursively computing the forward probabilities.

Let αt(i)\alpha_t(i) be the probability of observing the partial sequence o1,o2,,oto_1, o_2, \dots, o_t and ending in state sis_i at time tt. Mathematically:

αt(i)=P(o1,o2,,ot,st=siλ)\alpha_t(i) = P(o_1, o_2, \dots, o_t, s_t = s_i \,|\, \lambda)

Algorithm:
  1. Initialization:

    α1(i)=πibi(o1)\alpha_1(i) = \pi_i b_i(o_1)

    Where bi(o1)b_i(o_1) is the emission probability of o1o_1 given state sis_i, and πi\pi_i is the initial probability of state sis_i.

  2. Recursion:

    αt(i)=[j=1Nαt1(j)aji]bi(ot)\alpha_t(i) = \left[ \sum_{j=1}^N \alpha_{t-1}(j) a_{ji} \right] b_i(o_t)

    This equation recursively calculates the probability of observing the sequence up to time tt, and ending in state sis_i at time tt.

  3. Termination:

    P(Oλ)=i=1NαT(i)P(O \,|\, \lambda) = \sum_{i=1}^N \alpha_T(i)

    The final probability of the entire observation sequence is the sum of the forward probabilities at the last time step.


2. Decoding Problem (Viterbi Algorithm)

The decoding problem involves determining the most likely sequence of hidden states that produced the observed sequence.

Task: Find the sequence S=(s1,s2,,sT)S = (s_1, s_2, \dots, s_T) that maximizes the probability P(SO,λ)P(S \,|\, O, \lambda).

Viterbi Algorithm:

The Viterbi Algorithm solves this problem by using dynamic programming to find the optimal state sequence.

Let δt(i)\delta_t(i) represent the probability of the most likely path to state sis_i at time tt given the observations o1,o2,,oto_1, o_2, \dots, o_t.

Algorithm:
  1. Initialization:

    δ1(i)=πibi(o1)\delta_1(i) = \pi_i b_i(o_1)

    Where πi\pi_i is the initial state probability, and bi(o1)b_i(o_1) is the emission probability of o1o_1.

  2. Recursion:

    δt(i)=maxj=1N(δt1(j)aji)bi(ot)\delta_t(i) = \max_{j=1}^N \left( \delta_{t-1}(j) a_{ji} \right) b_i(o_t)

    This equation finds the most likely path to state sis_i at time tt, considering all possible previous states.

  3. Termination:

    Pmax=maxiδT(i)P_{\text{max}} = \max_{i} \delta_T(i)

    The final probability is the maximum value among all the possible ending states.

  4. Backtracking: To reconstruct the state sequence, we backtrack through the argmax decisions stored at each time step.


3. Learning Problem (Baum-Welch Algorithm)

In this problem, we want to learn the parameters AA, BB, and π\pi of the HMM from a set of observed sequences. This is done by maximizing the likelihood of the observed data.

Task: Find the parameters λ=(A,B,π)\lambda = (A, B, \pi) that maximize P(Oλ)P(O \,|\, \lambda).

Baum-Welch Algorithm:

The Baum-Welch Algorithm is an implementation of the Expectation-Maximization (EM) algorithm, which iteratively estimates the parameters.

Algorithm:

  1. E-step:
    • Calculate the forward probabilities αt(i)\alpha_t(i) and backward probabilities βt(i)\beta_t(i).
    • Compute the responsibilities (expected counts) of state transitions and emissions: γt(i)=P(st=siO,λ)\gamma_t(i) = P(s_t = s_i \,|\, O, \lambda) ξt(i,j)=P(st=si,s\xi_{t}(i, j) = P(s_t = s_i, s

Absolutely! Let’s continue from where we left off, focusing on the Baum-Welch algorithm and its details.


Baum-Welch Algorithm (Continued)

The Baum-Welch algorithm is an Expectation-Maximization (EM) method used to find the parameters of an HMM, given a set of observation sequences. It iteratively refines the estimates of the model parameters by maximizing the likelihood of the observations.

Expectation Step (E-Step)

In this step, we compute the expected values of the state occupation probabilities (denoted as γt(i)\gamma_t(i)) and the transition probabilities (denoted as ξt(i,j)\xi_t(i,j)) using the forward and backward algorithms.

  1. Forward Probability αt(i)\alpha_t(i): This represents the probability of the partial sequence o1,o2,,oto_1, o_2, \dots, o_t, and being in state sis_i at time tt, given the model parameters.

  2. Backward Probability βt(i)\beta_t(i): This represents the probability of the partial sequence ot+1,ot+2,,oTo_{t+1}, o_{t+2}, \dots, o_T, and being in state sis_i at time tt, given the model parameters.

Using these two quantities, we calculate:

  • State Occupation Probability γt(i)\gamma_t(i):

    γt(i)=P(st=siO,λ)=αt(i)βt(i)P(Oλ)\gamma_t(i) = P(s_t = s_i \,|\, O, \lambda) = \frac{\alpha_t(i) \beta_t(i)}{P(O \,|\, \lambda)}

    This is the probability that the system is in state sis_i at time tt, given the observed sequence OO.

  • State Transition Probability ξt(i,j)\xi_t(i,j):

    ξt(i,j)=P(st=si,st+1=sjO,λ)=αt(i)aijbj(ot+1)βt+1(j)P(Oλ)\xi_t(i,j) = P(s_t = s_i, s_{t+1} = s_j \,|\, O, \lambda) = \frac{\alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)}{P(O \,|\, \lambda)}

    This represents the probability that the system is in state sis_i at time tt and in state sjs_j at time t+1t+1, given the observed sequence.

Maximization Step (M-Step)

In this step, the model parameters are updated based on the expected counts computed in the E-step. The goal is to re-estimate the HMM parameters AA, BB, and π\pi such that they maximize the likelihood of the observed data.

  1. Update the transition probabilities:

    aij=t=1T1ξt(i,j)t=1T1γt(i)a_{ij} = \frac{\sum_{t=1}^{T-1} \xi_t(i,j)}{\sum_{t=1}^{T-1} \gamma_t(i)}

    This gives the probability of transitioning from state sis_i to state sjs_j by considering the expected number of transitions from state sis_i to state sjs_j, normalized by the expected number of times the system was in state sis_i.

  2. Update the emission probabilities:

    bi(ot)=t=1Tγt(i)I(ot=o)t=1Tγt(i)b_i(o_t) = \frac{\sum_{t=1}^{T} \gamma_t(i) \mathbb{I}(o_t = o)}{\sum_{t=1}^{T} \gamma_t(i)}

    This represents the probability of observing oto_t in state sis_i. The I(ot=o)\mathbb{I}(o_t = o) is an indicator function, which is 1 if the observation at time tt is oo, and 0 otherwise. This counts the number of times the observation oto_t occurred while in state sis_i, normalized by the total number of times the system was in state sis_i.

  3. Update the initial state probabilities:

    πi=γ1(i)\pi_i = \gamma_1(i)

    The initial state probabilities are updated to the expected proportion of the system being in state sis_i at the first time step.


4. Applications of HMM

Hidden Markov Models have broad applications in a variety of fields, especially where time-sequenced data is prevalent. Here are some key applications:

1. Speech Recognition

In speech recognition, the sequence of sounds or phonemes spoken is modeled as a sequence of hidden states, and the observed data are the acoustic features (e.g., Mel-frequency cepstral coefficients - MFCCs) extracted from the sound wave. HMMs are used to decode the most likely sequence of phonemes or words that correspond to the speech signal.

  • States: Phonemes or words.
  • Observations: Acoustic features (e.g., frequency components).
  • Transition Probabilities: The probability of transitioning from one phoneme to another (e.g., from "s" to "p").
  • Emission Probabilities: The probability of observing a specific feature vector given a particular phoneme.

2. Part-of-Speech Tagging

In Natural Language Processing (NLP), HMMs are widely used for part-of-speech (POS) tagging, where the goal is to assign each word in a sentence to a part-of-speech category (e.g., noun, verb, adjective).

  • States: POS tags (e.g., NN, VB, JJ).
  • Observations: Words in the sentence.
  • Transition Probabilities: The likelihood of transitioning from one POS tag to another (e.g., from NN to VB).
  • Emission Probabilities: The probability of a particular word being generated from a POS tag (e.g., "run" being a verb or noun).

3. Bioinformatics

In bioinformatics, HMMs are used to analyze biological sequences, such as DNA or protein sequences. The sequences contain hidden biological states (e.g., exons, introns, or genes) that affect observable features (e.g., nucleotide or amino acid sequences).

  • States: Biological states such as exons, introns, or different functional regions in genes.
  • Observations: DNA sequences (e.g., nucleotide bases A, C, G, T).
  • Transition Probabilities: The probability of transitioning between different biological regions (e.g., from exon to intron).
  • Emission Probabilities: The likelihood of observing a specific nucleotide at a given position in a gene.

4. Financial Modeling

HMMs can be used to model financial markets, where the hidden states represent different market conditions (e.g., bullish, bearish, or stable), and the observed data are stock prices, returns, or volatility.

  • States: Market regimes (e.g., bullish, bearish).
  • Observations: Stock prices or returns.
  • Transition Probabilities: The likelihood of moving from one market regime to another.
  • Emission Probabilities: The probability distribution of returns given a specific market regime.

5. Limitations of HMMs

While HMMs are powerful and widely used, they also have certain limitations:

  1. Markov Assumption: HMMs assume that the process is a first-order Markov process, meaning that the current state depends only on the previous state, not the entire history. This assumption may not always hold in real-world systems.

  2. Fixed Number of States: HMMs require the number of states to be predefined, and the structure of the hidden states may not be well-known in advance, leading to potential difficulties in model selection.

  3. Gaussian Assumption (in Continuous HMMs): In continuous HMMs, it is common to model the emission probabilities using Gaussian distributions. However, not all types of data follow a Gaussian distribution, which could affect the model's performance.

  4. Parameter Estimation: HMMs can suffer from issues like overfitting when there is not enough training data, especially if the number of parameters in the model (i.e., the number of hidden states) is large.

  5. Computational Complexity: Although the algorithms (such as the forward, Viterbi, and Baum-Welch algorithms) are efficient, they still require significant computational resources, especially for large state spaces or long sequences.


6. Conclusion

Hidden Markov Models (HMMs) are an invaluable tool for modeling sequential data, where there is an underlying structure (hidden states) that drives observable outcomes. With their applications ranging from speech recognition to bioinformatics and financial modeling, HMMs have proven to be highly versatile and widely applicable.

Despite their power, HMMs come with challenges related to model assumptions (like the Markov property), parameter estimation, and computational cost. However, with the right understanding of these models and the use of algorithms like Viterbi, Forward, and Baum-Welch, HMMs provide an effective framework for solving a variety of sequential prediction problems.