Expectation Maximization (EM)

EM For Naive Bayes Model

Suppose we have a class variable \(C\) and discrete evidence variables \(X_1,\dots,X_n\).

Given a dataset \(\mathcal{D} = \{ \mathbf{x}[1], \dots, \mathbf{x}[M]\}\), where \(\mathbf{x}[m]\) represents

EM converges in one iteration, and we can derive a closed form expression of the parameter values at convergence:

E-Step For each training example \(\mathbf{x}^{(j)}\), we compute for all possible values \(c\) that \(C\) can take on: \begin{equation} Q^{(t+1)}(C=c \mid \mathbf{x}^{(j)}) = P(C=c \mid \mathbf{x}^{(j)}, \theta^{(t)} ) \end{equation} using our current parameter guess \(\theta^{(t)}\). By the definition of conditional probability, we know \begin{equation} P(C=c \mid \mathbf{x}^{(j)}, \theta^{(t)} ) = \frac{P(C=c, \mathbf{x}^{(j)})}{P(x^{(j)})}
\end{equation}

Our Bayesian Network defined by the “Naive Bayes” assumption has \(n+1\) nodes. Let \(i\) denote the node index. We have 1 class variable \(C\) and \(n\) variables for the features \(X_i\)).

Bayes Net structure for a Naive Bayes Model.

Substituting in the definition of the joint probability distribution as a product of CPD tables:

\[P(C=c \mid \mathbf{x}^{(j)}, \theta^{(t)} ) = \frac{ P(C=c)\prod\limits_{i=1}^{n} P(\mathbf{x}_k^{(j)} \mid C=c)}{ \sum\limits_c P(C=c) \prod\limits_{i=1}^{n} P(\mathbf{x}_k^{(j)} \mid C=c) }\]

We choose parameters for the class label prior as \(\theta_C^0 = \frac{1}{|\text{Val}(C)|}\). We say the \(i\)‘th node \(X_i\) takes on value \(x_i\) according to parameter \(P(X_i = \hat{x}_i \mid C=c\) \(= \theta_{x_i \mid c}\). This simplifies to \begin{equation} P(C=c \mid \mathbf{x}^{(j)}, \theta^{(t)} ) = \frac{\frac{1}{|\text{Val}(C)|} \prod\limits_{i=1}^{n} \frac{1}{ |\text{Val}(X_i)|}}{\frac{1}{|\text{Val}(C)|} \sum\limits_c \prod\limits_{i=1}^{n} \frac{1}{ |\text{Val}(X_i)|}}
\end{equation} The left term cancels in the numerator and denominator: \begin{equation} P(C=c \mid \mathbf{x}^{(j)}, \theta^{(t)} ) = \frac{ \prod\limits_{i=1}^{n} \frac{1}{ |\text{Val}(X_i)|}}{ \sum\limits_c \prod\limits_{i=1}^{n} \frac{1}{ |\text{Val}(X_i)|}}
\end{equation} The products can be simplified to a fraction raised to an exponent \begin{equation} P(C=c \mid \mathbf{x}^{(j)}, \theta^{(t)} ) = \frac{ \frac{1}{ |\text{Val}(X_i)|^n}}{ \sum\limits_c \frac{1}{ |\text{Val}(X_i)|^n}} = \frac{ \frac{1}{ |\text{Val}(X_i)|^n}}{ \frac{|\text{Val}(C)|}{ |\text{Val}(X_i)|^n}} \end{equation}

This fraction can finally be simplified to \begin{equation} P(C=c \mid \mathbf{x}^{(j)}, \theta^{(t)} ) = \frac{1}{ |\text{Val}(C)|} \end{equation}

M-Step

\begin{equation} \theta^{(t+1)} = \mbox{arg }\underset{\theta}{\mbox{max }}\sum\limits_{j=1}^{M} \sum\limits_{C}Q^{(t+1)}(C \mid x) \mbox{log}P(x,C \mid \theta) \end{equation}

This gives us a weighted MLE formulation by using the weights \begin{equation} Q^{(t+1)}(C=\hat{c} \mid x) =p_{\theta}(C=\hat{c} \mid \mathbf{x}^{(j)}) \end{equation} for the tuples of (observed data, hidden assignment), here given by \((\mathbf{x}^{(j)}, \hat{c})\): \begin{equation} \theta_{x_i \mid C}^{(t+1)} = \frac{ \sum\limits_{j=1}^{M} \sum\limits_{\hat{c}} \mathbb{1} { (\mathbf{x}^{(j)}, \hat{c}) \mbox{ consistent w. } x_i = , C=\hat{c} } p_{\theta}(C=\hat{c} \mid \mathbf{x}^{(j)}) }{\sum\limits_{j=1}^{M} \sum\limits_{\hat{c}} \mathbb{1} { (\mathbf{x}^{(j)}, \hat{c}) \mbox{ consistent w. } C=\hat{c} } p_{\theta}(C=\hat{c} \mid \mathbf{x}^{(j)}) } \end{equation} Plugging in our computed values for \(p_{\theta}(C=\hat{c} \mid \mathbf{x}^{(j)})\), we see \begin{equation} \theta_{x_i \mid C}^{(t+1)} = \frac{ \sum\limits_{j=1}^{M} \sum\limits_{\hat{c}} \mathbb{1} { (\mathbf{x}^{(j)}, \hat{c}) \mbox{ consistent w. } x_i = , C=\hat{c} } \frac{1}{ |\text{Val}(C)|} }{\sum\limits_{j=1}^{M} \sum\limits_{\hat{c}} \mathbb{1} { (\mathbf{x}^{(j)}, \hat{c}) \mbox{ consistent w. } C=\hat{c} } \frac{1}{ |\text{Val}(C)|} } \end{equation} This right term is independent of the sum over \(\hat{c}\) and the sum over \(j\), so it can be pulled out \begin{equation} \theta_{x_i \mid C}^{(t+1)} = \frac{ \frac{1}{ |\text{Val}(C)|} \sum\limits_{j=1}^{M} \sum\limits_{\hat{c}} \mathbb{1} { (\mathbf{x}^{(j)}, \hat{c}) \mbox{ consistent w. } x_i = , C=\hat{c} } }{ \frac{1}{ |\text{Val}(C)|} \sum\limits_{j=1}^{M} \sum\limits_{\hat{c}} \mathbb{1} { (\mathbf{x}^{(j)}, \hat{c}) \mbox{ consistent w. } C=\hat{c} } } \end{equation} and this term now cancels in the numerator and denominator: \begin{equation} \theta_{x_i \mid C}^{(t+1)} = \frac{ \sum\limits_{j=1}^{M} \sum\limits_{\hat{c}} \mathbb{1} { (\mathbf{x}^{(j)}, \hat{c}) \mbox{ consistent w. } x_i = , C=\hat{c} } }{ \sum\limits_{j=1}^{M} \sum\limits_{\hat{c}} \mathbb{1} { (\mathbf{x}^{(j)}, \hat{c}) \mbox{ consistent w. } C=\hat{c} } } \end{equation} We see that we’ve reduced the expression to the unweighted MLE formulation: