Gaussian Mixture Model

Probability

Gaussian Mixture Model (GMM)

This article provides an overview of the Gaussian Mixture Model. The Gaussian Mixture Model is a model used to represent probability distributions by combining multiple normal distributions. If the probability density function of a normal distribution is denoted as \(\small N\left(x \middle| \mu,\sigma^2\right) \), it can be expressed mathematically as :

\[ \small p(x)=\sum_{k=1}^K \pi_k N\left(x \middle| \mu_k, \sigma_k^2\right), \quad \sum_{k=1}^K \pi_k = 1,\quad \pi_k > 0. \]

 Arbitrary probability distributions can be represented by increasing the number of underlying normal distributions. Furthermore, since probability distributions can be smoothed through parameter estimation, this approach could be used to reduce noise or minimize data storage requirements—serving as an alternative to retaining raw sample data in the form of a non-parametric distribution (such as a histogram) by instead using an estimated Gaussian mixture model.

EM Algorithm

To estimate parameters from observation data \(\small x_1,\cdots,x_N\), one can calculate the log-likelihood and determine the parameters that maximize it by solving an optimization problem. In other words, one can estimate the parameters by designating the log-likelihood function as:

\[ \small \log L(x | \pi_k, \mu_k, \sigma^2_k,k=1,\cdots,K) =\sum_{i=1}^N \ln p(x_i)\]

and solving the optimization problem:

\[ \small \max_{\pi_k, \mu_k, \sigma^2_k} \log L(x | \pi_k, \mu_k, \sigma^2_k,k=1,\cdots,K). \]

This optimization problem is a constrained nonlinear optimization problem that is generally difficult to solve. Furthermore, since the function is typically non-convex, there is generally no algorithm capable of computing a global optimal solution with reasonable computational complexity (it corresponds to a mathematical problem classified as NP-hard).

 A practical approach would be to start from a suitable initial value and search for a local optimum. The algorithm known for efficiently searching for such a local optimum is the EM (Expectation-Maximization) algorithm. Note that this algorithm does not guarantee the discovery of a global optimum. It is an algorithm that iteratively improves the likelihood function and treats the converged solution as a local optimum. The specific computational procedure is as follows.

  1. Determine an initial point \(\small \pi_k^{(0)}, \mu_k^{(0)}, {\sigma^2_k}^{(0)} \) with suitable parameters. Set \(\small i=0 \).
  2. Using the current parameters \(\small \pi_k^{(i)}, \mu_k^{(i)}, {\sigma^2_k}^{(i)} \), the probability \(\small \gamma_{jk}^{(i)}\) that each observation \(\small x_1,\cdots,x_N\) belongs to cluster \(\small k\) is estimated using the following equation:

\[ \small \gamma_{jk}^{(i)} = \frac{\pi_k^{(i)} N\left(x_j \middle| \mu_k^{(i)}, {\sigma_k^2}^{(i)}\right)}{\sum_{l=1}^K\pi_l^{(i)} N\left(x_j \middle| \mu_l^{(i)}, {\sigma_l^2}^{(i)}\right)}\]

  1. Parameters are estimated using the probability \(\small \gamma_{jk}^{(i)}\) of belonging to the estimated cluster \(\small k\).

\[ \small \begin{align*} &\pi_k^{(i+1)} = \frac{1}{N}\sum_{j=1}^N \gamma_{jk}^{(i)} \\ &\mu_k^{(i+1)} = \frac{\sum_{j=1}^N \gamma_{jk}^{(i)}x_j}{\sum_{j=1}^N \gamma_{jk}^{(i)}} \\ &{\sigma_k^2}^{(i+1)} = \frac{\sum_{j=1}^N \gamma_{jk}^{(i)}(x_j-\mu_k^{(i+1)})^2}{\sum_{j=1}^N \gamma_{jk}^{(i)}} \end{align*}\]

  1. Set this as \(\small i=i+1\) and return to step 2. Repeat the above procedure until convergence is reached.

The procedure in step 2 is referred to as the E-Step, and the procedure in step 3 as the M-Step. When using programming languages ​​such as Python, there is no need to implement them from scratch; they can be utilized simply by calling functions from mathematical computing libraries like scikit-learn.

Information Criteria

In GMM, the number of underlying normal distributions \(\small K\) (the number of clusters) must be specified exogenously. When using the method simply as a substitute for storing a non-parametric histogram of sample data, one may assign an arbitrary numerical value; however, when estimating parameters for purposes such as smoothing a probability distribution or reducing noise, selecting the optimal number of parameters helps avoid overfitting. The criteria used to select this optimal number of parameters are known as information criteria.

 The Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) are commonly used examples of information criteria. If \(\small \log L\) represents the maximum value of the maximized likelihood function and \(\small p \) represents the number of parameters, then the Akaike Information Criterion is expressed as:

\[ \small \text{AIC} =-2 \log L + 2p =-2 \log L + 2 (3K-1) \]

and Bayesian Information Criterion is expressed as:

\[ \small \text{BIC} =-2 \log L + p \log N =-2 \log L + (3K-1) \log N. \]

In a GMM, there are three parameters per cluster; however, note that the number of parameters is reduced by one because the weighted average coefficients sum to 1. Therefore, \(\small p=3K-1\) holds. Calculate these values ​​for each \(\small K\); the \(\small K\) that yields the smallest value is the optimal number of clusters.

Comments