Approximate inference methods make it possible to learn realistic models from big data by trading off computation time for accuracy, when exact learning and inference are computationally intractable.

# Approximate Inference

• In latent variable model, we might want to extract features $E[h \mid v]$ describing the observed variables $v$ (visible variable $v$, a set of latent variables $h$).
• We often training our models using the principle of maximum likelihood.
• Because we often want to compute $p(h \mid v)$ in order to implement a learning rule.
• This is inference problem in which we must predict the value of some variables given other variables, or predict the probability distribution over some variables given the value of other variables.
• Unfortunately, inference problems are intractable, because of computing $p(h \mid v)$ or taking expectations with respect to it is difficult which are often necessary for tasks like maximum likelihood learning.
• Also, computing the marginal probability of a general graphical model is $\# P$ hard which is a generalization of the complexity class $NP$.
• So, exact inference requires an exponential amount of time in these models.

# Inference as Optimization

• Decomposition of Log-Likelihood
• Approximate inference algorithms may be derived by approximating the underlying optimization problem.
• To construct the optimization problem, we would like to compute the log probability of the observed data, $\log p(v; \theta)$.
• Sometimes, it is difficult to compute $\log p(v;\theta)$ if it is costly to marginalize out $h$.
• Instead, we can compute a lower bound $\mathcal{L}(v,\theta,q)$ on $\log p(v;\theta)$
• This bound is called the evidence lower bound (ELBO) or the negative variational free energy.
• Where $q$ is an arbitrary probability distribution over $h$, the ELBO is defined to be: • Because the difference between $\log p(v)$ and $\mathcal{L}(v,\theta,q)$ is given by the KL divergence and because the KL divergence is always non-negative, we can see that $\mathcal{L}$ always has at most the same value as the desired log probability.
• The two are equal if and only if $q$ is the same distribution as $p(h \mid v)$.
• $\mathcal{L}$ can be considerably easier to compute for some distributions q. which leads the more canonical definition of the evidence lower bound,

• For an appropriate choice of $q$, $\mathcal{L}$ is tractable to compute.
• For any choice of $q$, $\mathcal{L}$ provides a lower bound on the likelihood.
• For $q(h \mid v)$ that are better approximations of $p(h \mid v)$, the lower bound $\mathcal{L}$ will be tighter which is closer to $\log p(v)$.
• When $q(h \mid v) = p(h \mid v)$, the approximation is perfect, and $\mathcal{L}(v,\theta,q)=\log p(v;\theta)$.
• We can think of inference as the procedure for finding the $q$ that maximizes $\mathcal{L}$.
• Exact inference maximizes $\mathcal{L}$ perfectly by searching over a family of functions $q$ that includes $p(h \mid v)$.
• Below techniques are the way to derive different forms of approximate inference by using approximate optimization to find $q$.
• No matter what choice of $q$ we use, $\mathcal{L}$ is as lower bound.
• We can get tighter or looser bounds that are cheaper or more expensive to compute depending on how we choose to approach this optimization problem.

# Expectation Maximization

• Expectation Maximization (EM) is an algorithm for maximizing a lower bound $\mathcal{L}$ which is popular training algorithm for models with latent variables.
• EM is not an approach to approximate inference, but rather an approach to learning with an approximate posterior.
• EM is widely used algorithm for MLE and MAP problem of probabilistic model with latent variable.
• The EM algorithm consists of alternating between two steps until convergence:
1. E-step (Expectation step)
• We maximize $\mathcal{L}$ with respect to $q$.
• Let $\theta^0$ denote that value of the parameters at the beginning of the step.
• Set $q(h^i \mid v) = p(h^i \mid v^i; \theta^0)$ for all indices $i$ of the training examples $v^i$ we want to train on.
• By this we mean $q$ is defined in terms of the current parameter value of $\theta^0$; if we vary $\theta$ then $p(h \mid v;\theta)$ will change but $q(h \mid v)$ will remain equal to $p(h \mid v;\theta^0)$.
• E.g. In K-means clustering, E-step assigns the data points to the nearest centroid. 1. M-step (Maximization step)
• We maximize $\mathcal{L}$ with respect to $\theta$.
• Completely or partially maximize with respect to $\theta$ using your optimization algorithm of choice.
• E.g. In K-means clustering, M-step update the centroid positions given the assignments.  # Variational Inference

• Variational inference is that we approximate the true distribution $p(h \mid v)$ by seeking an approximate distribution $q(h \mid v)$ that is as close to the true one as possible.