**Approximate inference** methods make it possible to learn realistic models from big data by trading off computation time for accuracy, when exact learning and inference are computationally intractable.

# Approximate Inference

- In latent variable model, we might want to extract features describing the observed variables (visible variable , a set of latent variables ).
- We often training our models using the principle of maximum likelihood.
- Because we often want to compute in order to implement a learning rule.
- This is
**inference problem**in which we must predict the value of some variables given other variables, or predict the probability distribution over some variables given the value of other variables.

- Unfortunately, inference problems are intractable, because of computing or taking expectations with respect to it is difficult which are often necessary for tasks like maximum likelihood learning.
- Also, computing the marginal probability of a general graphical model is hard which is a generalization of the complexity class .
- So, exact inference requires an exponential amount of time in these models.

# Inference as Optimization

**Decomposition of Log-Likelihood**- Approximate inference algorithms may be derived by approximating the underlying optimization problem.
- To construct the optimization problem, we would like to compute the log probability of the observed data, .
- Sometimes, it is difficult to compute if it is costly to marginalize out .
- Instead, we can compute a lower bound on
- This bound is called the
*evidence lower bound (ELBO)*or the negative*variational free energy*. - Where is an arbitrary probability distribution over , the ELBO is defined to be:

- Because the difference between and is given by the KL divergence and because the KL divergence is always non-negative, we can see that always has at most the same value as the desired log probability.
- The two are equal if and only if is the same distribution as .

- can be considerably easier to compute for some distributions q.

which leads the more canonical definition of the evidence lower bound,

- For an appropriate choice of , is tractable to compute.
- For any choice of , provides a lower bound on the likelihood.
- For that are better approximations of , the lower bound will be tighter which is closer to .
- When , the approximation is perfect, and .

- We can think of inference as the procedure for finding the that maximizes .
- Exact inference maximizes perfectly by searching over a family of functions that includes .
- Below techniques are the way to derive different forms of approximate inference by using approximate optimization to find .
- No matter what choice of we use, is as lower bound.
- We can get tighter or looser bounds that are cheaper or more expensive to compute depending on how we choose to approach this optimization problem.

# Expectation Maximization

**Expectation Maximization (EM)**is an algorithm for maximizing a lower bound which is popular training algorithm for models with latent variables.- EM is not an approach to approximate inference, but rather an approach to learning with an approximate posterior.
- EM is widely used algorithm for MLE and MAP problem of probabilistic model with latent variable.

- The EM algorithm consists of alternating between two steps until convergence:

**E-step (Expectation step)**- We maximize with respect to .
- Let denote that value of the parameters at the beginning of the step.
- Set for all indices of the training examples we want to train on.
- By this we mean is defined in terms of the current parameter value of ; if we vary then will change but will remain equal to .
- E.g. In K-means clustering, E-step assigns the data points to the nearest centroid.

**M-step (Maximization step)**- We maximize with respect to .
- Completely or partially maximize with respect to using your optimization algorithm of choice.
- E.g. In K-means clustering, M-step update the centroid positions given the assignments.

# Variational Inference

**Variational inference**is that we approximate the true distribution by seeking an approximate distribution that is as close to the true one as possible.