Markov Chains
Table of Contents:
Markov chains and Markov chain Monte Carlo (MCMC) algorithms are used widely in computer vision and robotics, especially for localization [1], image denoising, data association, and much more. Two popular MCMC algorithms are Metropolis-Hastings and Gibbs Sampling.
Markov Chains
Markov processes are popular probabilistic models. The simplest version of a Markov process is the discrete-time finite-state homogeneous Markov chain.
The idea is to have a directed graph, in which nodes correspond to states, and edges are labeled by the probabilities of going from one state at the current time to other states at the next time.
Every markov chain has a stationary distribution. at least one. not always unique. we can sometimes compute it.
- A Markov chain is irreducible if there is a positive probability of getting from any state to another in a finite number of steps. Irreducible means the graph associated to the MC has one strongly connected component (SCC).
- MC is aperiodic if you can return to any state \(x\) at any time (i.e. every state has period equal to one).
- Need Ergodic to have unique stationary distribution. An MC is ergodic if it is irreducible and aperiodic.
More precisely, the Markov chain being considered is a sequence of random variables \(X_1,X_2,X_3,\cdots\), each of which take values in \(\{1,2,··· ,N\}\) where \(N\) is the number of states. The index of this sequence is the discrete time, and at each time the chain makes a probabilistic jump, such that the probability of moving to state \(i\) conditioned on the current state being \(j\) is given by
\[\mathbb{P}(X_{k+1} = i \mid X_k = j) = p_{ij},\]where \(p_{ij}\) is the \((i,j)\)-entry of a constant matrix \(P\) called transition matrix.
The sum of entries in any column of P is always 1, and consequently \(P\) always has 1 as an eigenvalue. The corresponding eigenvector \(\mu\), appropriately normalized such as its entries sum up to 1, is called a stationary distribution (a.k.a. invariant distribution), because if the probability of having the current chain at each state is described by \(\mu\), i.e. \(P(X_k = i) = \mu_i\), then the chain will have the same distribution at the next time, i.e. \(P(X_{k+1} = i) = \mu_i\), and at all future times.
In addition, it can be shown that, if the span of all eigenvectors associated with any eigenvalue with absolute value 1 is of one-dimensional (known as a sufficient condition for ergodicity), then no matter what the initial distribution is, the chain will always converge to the stationary distribution, which is also unique. Therefore, \(\mu\) is important to find, and in certain situations, we can compute it.
Apparently, \((P −I)μ = 0\). In addition, the kernel of \(P −I\) is one dimensional under the sufficient condition for ergodicity. Therefore, one can take an arbitrary row in \(P −I\) and replace it by all ones. Call the new matrix K. Then Kμ = b, where b is vector with all zero entries except for the row corresponding the replaced row in K. At that row, b’s entry is 1, so that it corresponds to a constraint of \(\sum\limits_i \mu_i = 1\). This way, the task of finding the invariant distribution μ is converted to solving the linear system.
This approach is general.
Simple Markov Chain Example: Clock States
Consider a circle of N states that are clockwise ordered (i.e., if N = 12, the states correspond the ticks on a standard clock). Conditioned on the current state being i, the chain can jump to four neighboring states: \(i −2,i −1,i\), and \(i + 1\) in the sense of modulo (e.g., if \(i = 1\), then the states \(i −1\) and \(i −2\) are actually the states \(N\) and \(N −1\), and if i = N, then the state \(i + 1\) is actually the state 1). The transition probabilities are given by
\[\begin{aligned} \begin{cases} \mathbb{P}(X_{k+1} = i −2 \mid X_k = i) &= \cos(\phi_i)^2 \cos(\phi_i)^2 \cos(\psi_i)^2 \\ \mathbb{P}(X_{k+1} = i −1 \mid X_k = i) &= \cos(\phi_i)^2 \sin(\phi_i)^2 \\ \mathbb{P}(X_{k+1} = i \mid X_k = i) &= \sin(\phi_i)^2 \\ \mathbb{P}(X_{k+1} = i + 1 \mid X_k = i) &= \cos(\phi_i)^2 \cos(\phi_i)^2 \sin(\psi_i)^2 \end{cases} \end{aligned}\]where \(\phi_i = i \pi/(N + 1)\) and \(\psi_i = 2i \pi/(N + 1)\).
An iterative linear system solver can find the invariant distribution of this chain.
In order to find the stationary distribution of a Markov Chain,
Let \(P^T \in \mathbb{R}^{n \times n}\) be a matrix where the entries in any row sum to 1. By the definition of matrix-vector products, the following equation must hold:
\[\begin{aligned} P^T \mathbf{1} &= \mathbf{1} \\ P^T v &= 1 v & \mbox{Let } v=\mathbf{1} \\ P^T v &= \lambda_i v & (v \mbox{ must be an eigenvector }) \end{aligned}\]\(\lambda_i=1\) must then be an eigenvalue of the matrix \(P^T\) and \(v=\mathbf{1}\) must be its corresponding eigenvector.
Consider the relationship between the eigenvalues of \(P^T\) and \(P\). Recall the following property of a determinant for square matrices: \(A \in \mathbb{R}^{n \times n}, \mbox{det}(A) = \mbox{det}(A^T)\). Consider another square matrix $(P^T-I) \in \mathbb{R}^{n \times n}$$:
\[\begin{aligned} P^T-I &= P^T-I^T \\ P^T-I &= (P-I)^T \\ \mbox{det}(P^T-I) &= \mbox{det}\Big((P-I)^T\Big) \\ \mbox{det}(P^T-I) &= \mbox{det}(P-I) & \mbox{by determinant property above} \\ 0 = \mbox{det}(P^T-I) &= \mbox{det}(P-I) & \mbox{by characteristic polynomial, since } 1 \in \lambda{P^T} \end{aligned}\]Note that \(P\) and \(P^T\) both share the same characteristic polynomial and accordingly the same eigenvalues, meaning \(\lambda\{P^T\}=\lambda\{P\}\). Thus, \(\lambda=1\) is also an eigenvalue of \(P\) and the entries in any \textit{column} of \(P\) must also sum to 1:
\[\begin{aligned} P\mu &= \lambda \mu \\ P \mathbf{1} &= \mathbf{1} & \mbox{let } \mu=\mathbf{1},\lambda=1 \end{aligned}\]By rearranging terms, we note that $\mu \in Null(P-I\lambda)$:
\[\begin{aligned} P\mu &= \lambda \mu \\ P\mu - \lambda \mu &= 0 \\ (P - I\lambda) \mu &= 0 \\ \mu &\in Null(P - I\lambda) \end{aligned}\]Since \(\mu=\mathbf{1} \in Null(P - I\lambda)\), then \((P-\lambda I)\) is a singular matrix. Thus if \(A=(P-I\lambda)\), then \(Ax=b\) is an underdetermined system of equations. By replacing any single equation in the system with a new equation forcing \(\mu\) to live on the probability simplex (\(1 = \sum\limits_i \mu_i)\), we can create a full-rank system of equations with a unique solution.
Solution with Direct Methods
>>> import numpy as np
>>> import scipy.linalg
>>> w = np.array([7,6,5])
>>> P = np.array([[0.3,0.3,0.4],[0.4,0.4,0.2],[0.5,0.3,0.2]])
>>> P
array([[0.3, 0.3, 0.4],
[0.4, 0.4, 0.2],
[0.5, 0.3, 0.2]])
>>> w = np.array([7,6,5])
>>> w
array([7, 6, 5])
>>> w / w.sum()
array([0.38888889, 0.33333333, 0.27777778])
>>> 7/18
0.3888888888888889
>>> 6/18
0.3333333333333333
>>> 5/18
0.2777777777777778
>>> scipy.linalg.null_space(P.T - np.eye(3))
array([[-0.66742381],
[-0.57207755],
[-0.47673129]])
>>> v = scipy.linalg.null_space(P.T - np.eye(3))
>>> v *= -1
>>> v
array([[0.66742381],
[0.57207755],
[0.47673129]])
>>> v /= v.sum()
>>> v
array([[0.38888889],
[0.33333333],
[0.27777778]])
Solution with Iterative Solvers
PageRank: Stochastic Matrices
MCMC
Suppose we have a high-dimensional probability distribution. MCMC is the most popular method for sampling from high-dimensional distributions, that are often found in image processing, computer vision, robotics, etc. We may not be able to define the transition matrix, but we will resort to approximate methods.
Use adaptive proposals, instead of a fixed proposal
All good MCMC algorithms must satisfy ergodicity, so that we can converge no matter the initialization.
Better than importance sampling because …
Metropolis-Hastings
Gibbs Sampling
Gibbs Sampling is an MCMC algorithm that samples each random variable of a graphical model, one at a time. GS is a special case of the MH algorithm.
Image Denoising Example Gibbs Sampling has been used for image restoration since the 1980s [6]. Given an image \(X\) corrupted by noise, we want to recover the original image \(Y\).
Suppose we are working with a binary image, i.e. pixel values can only take on values -1 or 1. The image has dimensions \(N \times M\). We assume a set of (unobserved) variables \(\mathbf{y} = \{y_{ij}\}\) represent the true (unknown) image, with \(y_{ij} \in \{−1, +1\}\) indicating the value of \(x_{ij}\) before noise was added. Each (internal) \(y_{ij}\) is linked with four immediate neighbors, \(y_{i−1,j}, y_{i+1,j}, y_{i,j−1}\), and \(y_{i,j+1}\), which together are denoted \(y_{\mathcal{N}(i,j)}\).
Conditional probability that pixel (i,j) is black given its Markov blanket We wish to compute \(p(y_{ij} = 1 \mid y_{M(i,j)})\) where \(y_{M(i,j)}\) are the variables in the Markov blanket of \(y_{ij}\). In an undirected model like an MRF, the Markov blanket of a variable simply denotes its neighbors. Thus, we wish to compute two conditional probabilities because \(|\text{Val}(y)|=2\): \begin{equation} p(y_{ij} = \mathbf{1} \mid x_{ij}, y_{i-1,j}, y_{i+1,j}, y_{i,j-1},y_{i,j+1}) \end{equation} and \begin{equation} p(y_{ij} = \mathbf{-1} \mid x_{ij}, y_{i-1,j}, y_{i+1,j}, y_{i,j-1},y_{i,j+1}) \end{equation} We know \begin{equation} \begin{aligned} p(y_{ij} = \mathbf{1} \mid x_{ij}, y_{i-1,j}, y_{i+1,j}, y_{i,j-1},y_{i,j+1}) \ \propto \phi(y_{ij},x_{ij}) \phi(y_{ij},y_{i-1,j}) \phi(y_{ij},y_{i+1,j}) \phi(y_{ij},y_{i,j-1}) \phi(y_{ij},y_{i,j+1}) \end{aligned} \end{equation} These factors are defined as
\begin{equation} \begin{aligned} p(y_{ij} = \mathbf{1} \mid x_{ij}, y_{i-1,j}, y_{i+1,j}, y_{i,j-1},y_{i,j+1}) \ \propto \exp(\eta y_{ij} x_{ij}) \exp(\beta y_{ij} y_{i-1,j}) \exp(\beta y_{ij},y_{i+1,j}) \exp(\beta y_{ij},y_{i,j-1}) \exp(\beta y_{ij},y_{i,j+1}) \end{aligned} \end{equation}
Since \(e^{a}e^{b} = e^{a+b}\), we can simplify the products above to a sum of exponents. We define the set of horizontal grid neighbors as
\[\mathcal{G}(i,j) = \Big\{(i-1,j),(i+1,j),(i,j-1),(i,j+1)\Big\}\]\begin{equation} p(y_{ij} = \mathbf{1} \mid x_{ij}, y_{i-1,j}, y_{i+1,j}, y_{i,j-1},y_{i,j+1}) \propto \exp\Big(\eta y_{ij} x_{ij} + \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } \beta y_{ij} y_{k,l}\Big) \end{equation} We can pull \(\beta\) out of the sum: \begin{equation} p(y_{ij} = \mathbf{1} \mid x_{ij}, y_{i-1,j}, y_{i+1,j}, y_{i,j-1},y_{i,j+1}) \propto exp\Bigg(\eta y_{ij} x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } y_{ij} y_{k,l}\Bigg) \end{equation} Since \(y\in \{-1,1\}\), by normalizing this probability, we immediately obtain a logistic function:
\begin{equation} p(y_{ij} = \mathbf{1} \mid y_{M(i,j)} ) = \frac{ \exp\Big(\eta 1 x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } 1 y_{k,l}\Big) }{ \exp\Big(\eta 1 x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } 1 y_{k,l}\Big) + \exp\Big(\eta (-1) x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } (-1) y_{k,l}\Big) } \end{equation} Dividing the terms in the denominator by the numerator, we see \begin{equation} p(y_{ij} = \mathbf{1} \mid y_{M(i,j)} ) = \frac{ 1 }{\frac{ \mbox{exp}\Bigg(\eta 1 x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } 1 y_{k,l}\Bigg) }{\mbox{exp}\Bigg(\eta 1 x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } 1 y_{k,l}\Bigg)} + \frac{ \mbox{exp}\Bigg(\eta (-1) x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } (-1) y_{k,l}\Bigg) } { \mbox{exp}\Bigg(\eta 1 x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } 1 y_{k,l}\Bigg) } } \end{equation} Simplifying further, \begin{equation} p(y_{ij} = \mathbf{1} \mid y_{M(i,j)} ) = \frac{ 1 }{ 1 + \frac{ \mbox{exp}\Bigg(\eta (-1) x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } (-1) y_{k,l}\Bigg) } { \mbox{exp}\Bigg(\eta 1 x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } 1 y_{k,l}\Bigg) } } \end{equation}
By the properties of exponents, division of exponentials is identical to subtracting an exponent: \begin{equation} p(y_{ij} = \mathbf{1} \mid y_{M(i,j)} ) = \frac{ 1 }{ 1 + \mbox{exp}\Bigg(\eta (-1) x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } (-1) y_{k,l} - \eta 1 x_{ij} - \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } 1 y_{k,l}\Bigg) } \end{equation} Simplifying terms, \begin{equation} p(y_{ij} = \mathbf{1} \mid y_{M(i,j)} ) = \frac{ 1 }{ 1 + \mbox{exp}\Bigg(-2\eta x_{ij} + -2\beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } y_{k,l} \Bigg) } \end{equation}
For the \(y_{ij} = \mathbf{-1}\) case:
\begin{equation} p(y_{ij} = \mathbf{-1} \mid x_{ij}, y_{i-1,j}, y_{i+1,j}, y_{i,j-1},y_{i,j+1}) \propto \mbox{exp}\Bigg(\eta y_{ij} x_{ij} + \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } \beta y_{ij} y_{k,l}\Bigg) \end{equation} We can pull out \(\beta\): \begin{equation} p(y_{ij} = \mathbf{-1} \mid x_{ij}, y_{i-1,j}, y_{i+1,j}, y_{i,j-1},y_{i,j+1}) \propto \mbox{exp}\Bigg(\eta y_{ij} x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } y_{ij} y_{k,l}\Bigg) \end{equation} Since \(y\in \{-1,1\}\), by normalizing this probability, we immediately obtain a logistic function: \begin{equation} p(y_{ij} = \mathbf{-1} \mid y_{M(i,j)} ) = \frac{ \mbox{exp}\Bigg(\eta (-1) x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } (-1) y_{k,l}\Bigg) }{ \mbox{exp}\Bigg(\eta 1 x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } 1 y_{k,l}\Bigg) + \mbox{exp}\Bigg(\eta (-1) x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } (-1) y_{k,l}\Bigg) } \end{equation} Dividing the terms in the denominator by the numerator, we see \begin{equation} p(y_{ij} = \mathbf{-1} \mid y_{M(i,j)} ) = \frac{ 1 }{\frac{ \mbox{exp}\Bigg(\eta 1 x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } 1 y_{k,l}\Bigg) }{\mbox{exp}\Bigg(\eta (-1) x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } (-1) y_{k,l}\Bigg)} + \frac{ \mbox{exp}\Bigg(\eta (-1) x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } (-1) y_{k,l}\Bigg) } { \mbox{exp}\Bigg(\eta (-1) x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } (-1) y_{k,l}\Bigg) } } \end{equation} Simplifying further, \begin{equation} p(y_{ij} = \mathbf{-1} \mid y_{M(i,j)} ) = \frac{ 1 }{ \frac{ \mbox{exp}\Bigg(\eta 1 x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } 1 y_{k,l}\Bigg) } { \mbox{exp}\Bigg(\eta (-1) x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } (-1) y_{k,l}\Bigg) } + 1 } \end{equation}
By the properties of exponents, division of exponentials is identical to subtracting an exponent: \begin{equation} p(y_{ij} = \mathbf{-1} \mid y_{M(i,j)} ) = \frac{ 1 }{ \mbox{exp}\Bigg(\eta 1 x_{ij} + \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } 1 y_{k,l} - \eta (-1) x_{ij} - \beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } (-1) y_{k,l}\Bigg) + 1 } \end{equation} Simplifying terms, \begin{equation} p(y_{ij} = \mathbf{-1} \mid y_{M(i,j)} ) = \frac{ 1 }{ 1 + \mbox{exp}\Bigg(2\eta x_{ij} + 2\beta \cdot \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } y_{k,l} \Bigg) } \end{equation}
If we have \(B\) ``burn-in” iterations, and then draws of \(S\) samples afterwards:
\begin{algorithm} \caption{Batch Update Systematic Scan Gibbs Sampling in order to compute the maximum a posteriori (MAP) query, \(\mbox{arg }\underset{y}{\mbox{max }} p(\mathbf{y} \mid \mathbf{x})\). Our transition kernel is \(Q(x^{\prime} \mid x)\). We draw samples \(a \sim p(Y_{ij}^{\prime} \mid Y_{-\{ij\}}^{(t)})\). Gibbs sampling assumes we can compute conditional distributions of one variable conditioned on all of the other variables and sample exactly from these distributions. In graphical models, the conditional distribution of some variable only depends on the variables in the Markov blanket of that node. Note that at timestep \(t\), we use \(Y_{i,j}^{(t)}\) (not \(Y_{i,j}^{(t+1)}\)) to update its right sequential neighbor, \(Y_{i,j+1}^{(t+1)}\). This corresponds to a single batch update to \(Y\) at the end of the iteration, vs. online updates that are used immediately. } \begin{algorithmic}[1] \Procedure {BatchUpdateSystemicScanGibbsSampling}{\(X\)} \State \(Y \leftarrow\) \texttt{zeros}\((N,M)\) \State \(c \leftarrow\) \texttt{zeros}\((N,M)\) \For {\(\text{row} \leftarrow 1, N\)} \For {\(\text{col} \leftarrow 1, M\)} \State \(Y_{ij}^{(t)} \leftarrow \texttt{init value}\) for \(t=0\) \State \(Y_{ij}^{(t+1)} \leftarrow 0\) for \(t=0\) \State \(c_{ij} \leftarrow 0\) \EndFor \EndFor \State \Comment Burn-in period… \For {\(t \leftarrow 0, B-1\)} \For {\(\text{row} \leftarrow 1, N\)} \For {\(\text{col} \leftarrow 1, M\)} \State \(Y_{ij}^{(t+1)} \leftarrow\) \texttt{SampleFromConditional}\((i,j,X,Y^{(t)})\) \EndFor \EndFor \EndFor \State \Comment Draw Samples … \For {\(t \leftarrow 0, S-1\)} \For {\(\text{row} \leftarrow 1, N\)} \For {\(\text{col} \leftarrow 1, M\)} \State \(Y_{ij}^{(t+1)} \leftarrow\) \texttt{SampleFromConditional}\((i,j,X,Y^{(t)})\) \If {\(Y_{ij}^{(t+1)} = 1\)} \State \(c_{ij}\) += 1 \EndIf \EndFor \EndFor \EndFor
\For {\(\text{row} \leftarrow 1, N\)} \For {\(\text{col} \leftarrow 1, M\)} \State \(p(y_{ij} = 1 \mid x) \leftarrow \frac{1}{S} \sum\limits_t \mathbb{1}\{y_{ij}^{(t)} = 1\} = \frac{1}{S} c_{ij}\) \EndFor \EndFor \EndProcedure
\Statex \Procedure {SampleFromConditional}{\(i\), \(j\), \(X\), \(Y\)} \State \(p(y_{ij} = 1 \mid y_{M(i,j)} \leftarrow 1/\Bigg(1+\mbox{exp}(-2\eta x_{ij} - 2 \beta \sum\limits_{ (k,l)\in \mathcal{G}(i,j) } y_{k,l}\Bigg)\) \State \(p(y_{ij} = 1 \mid y_{M(i,j)}) \leftarrow 1 - p(y_{ij} = \mathbf{1} \mid y_{M(i,j)} )\) \State \(u \leftarrow\) \texttt{RandFloat}(0,1) \If {\(u > p(y_{ij} = 1 \mid y_{M(i,j)})\)} \State \(Y_{ij} \leftarrow 1\) \Else \State \(Y_{ij} \leftarrow -1\) \State \textbf{return} \(Y_{ij}\) \EndIf
\EndProcedure
\end{algorithmic} \end{algorithm}
We show in our case that the equilibrium distribution is in fact the posterior distribution \(p(y \mid x)\) because of the properties of MCMC. If the chain is irreducible and aperiodic, it is ergodic. We can always move to another state with non-zero probability because an exponential function is always \(>0\) and thus \(\frac{1}{1 + \text{nonzero exponential} }\) will always exceed \(0\). We can reach all states from any state because we always sample from \(\{-1,1\}\) at each pixel.
\newpage \item
\begin{figure} \centering \begin{subfigure}{.33\textwidth} \centering \includegraphics[width=\linewidth]{Energies_log_noisy_10_batch_same_B_100_S_1000.png} \caption{\(Y=X\) initialization. } \label{fig:sub1} \end{subfigure}% \begin{subfigure}{.33\textwidth} \centering \includegraphics[width=\linewidth]{Energies_log_noisy_10_batch_neg_B_100_S_1000.png} \caption{\(Y=-X\) initialization.} \label{fig:sub2} \end{subfigure} \begin{subfigure}{.33\textwidth} \centering \includegraphics[width=\linewidth]{Energies_log_noisy_10_batch_rand_B_100_S_1000.png} \caption{random init of \(Y\) from \(\{-1,1\}\).} \label{fig:sub2} \end{subfigure} \caption{ 20\% Noise. Boltzmann (Gibbs) distribution energies as a function of iteration number. This visualization can be used to understand the mixing of the Markov chain. Energy of -50 is higher than energy of -100. Drawing on the physics metaphor, systems prefer to be in low energy states. Because the probability is proportional to energy, low energy corresponds to regions of high probability. \ \ \textbf{(Left)} \(Y=X\) initialization. Fastest convergence (because starts closest to truth). \ \textbf{(Center)} \(Y=-X\) initialization. Slowest convergence (because starts very far from ground truth). \ \textbf{(Right)} A random initialization of \(Y\) from \(\{-1,1\}\). Fast convergence, but not quite as fast as the ``Same” initialization gives. \ \ (1) All three DO SEEM TO BE CONVERGING to the same general region of the posterior. Some are obviously suboptimal – the right two initializations. \(2) The burn-in does seem to be adequate in length because there is relatively minor fluctuation in energy values afterwards.\ (3) There is very litte fluctuation from iteration to iteration in the sample period, which to a naive observer might indicate that the chain is mixing well. However, the chains are stuck at the ground truth (they’re within 2\% error of ground truth) so I’m find with them becoming stuck at a final particular energy for many iterations.} \label{fig:test} \end{figure}
% Energies_log_noisy_20_batch_neg_B_100_S_1000.png % Energies_log_noisy_20_batch_rand_B_100_S_1000.png % Energies_log_noisy_20_batch_same_B_100_S_1000.png
\newpage \item
\begin{figure} \centering \begin{subfigure}{.33\textwidth} \centering \includegraphics[width=\linewidth]{denoised_img_log_noisy_20_batch_same_B_100_S_1000.png} \caption{\(Y=X\) init., \(20\%\) noise} \label{fig:sub1} \end{subfigure}% \begin{subfigure}{.33\textwidth} \centering \includegraphics[width=\linewidth]{denoised_img_log_noisy_20_batch_neg_B_100_S_1000.png} \caption{\(Y=-X\) init., \(20\%\) noise} \label{fig:sub2} \end{subfigure} \begin{subfigure}{.33\textwidth} \centering \includegraphics[width=\linewidth]{denoised_img_log_noisy_20_batch_rand_B_100_S_1000.png} \caption{random init of \(Y\) from \(\{-1,1\}\), \(20\%\) noise} \label{fig:sub2} \end{subfigure} \begin{subfigure}{.33\textwidth} \centering \includegraphics[width=\linewidth]{denoised_img_log_noisy_10_batch_same_B_100_S_1000.png} \caption{\(Y=X\) init., \(10\%\) noise} \label{fig:sub1} \end{subfigure}% \begin{subfigure}{.33\textwidth} \centering \includegraphics[width=\linewidth]{denoised_img_log_noisy_10_batch_neg_B_100_S_1000.png} \caption{\(Y=-X\) init., \(10\%\) noise} \label{fig:sub2} \end{subfigure} \begin{subfigure}{.33\textwidth} \centering \includegraphics[width=\linewidth]{denoised_img_log_noisy_10_batch_rand_B_100_S_1000.png} \caption{random init of \(Y\) from \(\{-1,1\},\)10\%\(noise\)} \label{fig:sub2} \end{subfigure} \begin{subfigure}{.33\textwidth} \centering \includegraphics[width=\linewidth]{orig.png} \caption{Original image.} \label{fig:sub1} \end{subfigure}% \begin{subfigure}{.33\textwidth} \centering \includegraphics[width=\linewidth]{noisy_20.png} \caption{ \(20\%\) noise image.} \label{fig:sub2} \end{subfigure} \begin{subfigure}{.33\textwidth} \centering \includegraphics[width=\linewidth]{noisy_10.png} \caption{ \(20\%\) noise image.} \label{fig:sub2} \end{subfigure} \caption{ \textbf{Top Row}: Gibbs sampling results for 20\% noise. \textbf{Middle Row}: Gibbs sampling results for 10\% noise. \textbf{Bottom Row}: Ground truth and initial images. All use a burn-in of 100 iterations and 1000 sample iterations. } \label{fig:test} \end{figure}
\begin{center} \caption{Accuracy. All Gibbs sampling initializations achieve around 1.8\%-2\% error.} \begin{tabular}{| c | c | c | c | c | } \hline Noise Level & Num. Burn-in and Sample Iterations & Same & Neg & Rand
\hline 10 & 100, 1000 & 0.980141 & 0.980419 & 0.980106
\hline 20 & 100, 1000 & 0.975730 & 0.975672 & 0.975626
\hline \end{tabular} \end{center}
% \begin{figure} % \centering % \begin{subfigure}{.33\textwidth} % \centering % \includegraphics[width=\linewidth]{denoised_img_log_same_1000samples.png} % \caption{\(Y=X\) initialization.} % \label{fig:sub1} % \end{subfigure}% % \begin{subfigure}{.33\textwidth} % \centering % \includegraphics[width=\linewidth]{denoised_img_log_neg_1000samples.png} % \caption{\(Y=-X\) initialization} % \label{fig:sub2} % \end{subfigure} % \begin{subfigure}{.33\textwidth} % \centering % \includegraphics[width=\linewidth]{denoised_img_log_rand_1000samples.png} % \caption{random init of \(Y\) from \(\{-1,1\}\).} % \label{fig:sub2} % \end{subfigure} % \caption{ Gibbs sampling results over 1000 burn-in iterations. } % \label{fig:test} % \end{figure}
Neighborhood Consensus Voting
def sequential_consensus_voting(X: np.ndarray, Y: np.ndarray) -> np.ndarray:
""" """
num_rows, num_cols = X.shape[:2]
mb_flat_idx_dict = make_mb_flat_idx_dict(num_rows, num_cols)
for consensus_iter in range(30):
#im = (Y * 0.5) + 0.5 # make background black
im = .5*(1.0-Y) # make background white
im = im.astype(np.uint8)
convert_to_png(im, f"consensus_voting_iter_{consensus_iter}")
print(f'Consensus iter {consensus_iter}')
for i in range(num_rows):
for j in range(num_cols):
# get the 5 neighbors
mb_y_flat_indices = mb_flat_idx_dict[(i,j)]
mb_y_vals = Y.flatten()[mb_y_flat_indices]
mb_y_vals = np.hstack( (mb_y_vals, np.array([X[i,j]])) )
(label_values,counts) = np.unique(mb_y_vals, return_counts=True)
ind = np.argmax(counts)
# take the majority vote and set the
# middle pixel to most frequent element
Y[i,j] = label_values[ind]
return Y
\begin{figure} \centering \begin{subfigure}{.5\textwidth} \centering \includegraphics[width=\linewidth]{correct_denoised_dumb_20.png} \caption{Starting with \(20\%\) noise} \label{fig:sub1} \end{subfigure}% \begin{subfigure}{.5\textwidth} \centering \includegraphics[width=\linewidth]{correct_denoised_dumb_10.png} \caption{Starting with \(10\%\) noise} \label{fig:sub2} \end{subfigure} \caption{Denoised image by neighborhood consensus voting. Although these reconstructions are good, they are not as good as the Gibbs sampling reconstructions. Neighborhood consensus voting loses some fined-grained detail and tends to smear the challenging regions together (struggles to differentiate the white space inside of thick black letters, for example. The 20\% noise image achieves \(93.91\%\) accuracy via Consensus voting, the 10\% noise image achieves \(96.7338\%\) accuracy. This cannot equal the accuracy of Gibbs sampling denoising, which was \(\sim 98\%\) for both images. } % (\(10\%\) noise)} \label{fig:test} \end{figure}
% \begin{figure} % \centering % \begin{subfigure}{.5\textwidth} % \centering % \includegraphics[width=\linewidth]{denoised_dumb_20_whitebkgd.png} % \caption{White background} % \label{fig:sub1} % \end{subfigure}% % \begin{subfigure}{.5\textwidth} % \centering % \includegraphics[width=\linewidth]{denoised_dumb_20_blackbkgd.png} % \caption{Black background} % \label{fig:sub2} % \end{subfigure} % \caption{Denoised image by neighborhood consensus voting (\(20\%\) noise)} % \label{fig:test} % \end{figure}
\newpage \item
The square patch has dimensions \(37 \times 31\). Thus, we have 1147 pixels. \begin{itemize}
\item When I run my Gibbs sampler only on the small patch of 10\% noise image, I always see \(437\) pixels that are black (have value ``-1”). This value does not change.
The posterior distribution over frequencies follows a normal distribution
\begin{figure} \centering \includegraphics[width=\linewidth]{denoised_dumb_10_whitebkgd.png} \caption{10\% noise} \includegraphics[width=\linewidth]{denoised_dumb_20_whitebkgd.png} \caption{20\% noise} \end{figure}
\item When I run my Gibbs sampler only on the small patch of 20\% noise image, I always see \(428\) pixels that are black (have value ``-1”). This value does not change.
% \item When I run my Gibbs sampler on the entire 10\% image, and just index out the patch to count the ``-1” pixels, I see \(474\) pixels. The posterior distribution over frequencies follows a normal distribution
% \item \item When I run my Gibbs sampler on the entire 20\% image, and just index out the patch to count the ``-1” pixels, I see \(462\) pixels. The posterior distribution over frequencies follows a normal distribution.
\end{itemize}
% \begin{itemize} % \item \textbf{10\% Noise} - 666 (\(Y_{ij}=\mathbf{1}\)) pixels, 481 \(Y_{ij} =\mathbf{-1}\) pixels
% Sampling from patch only: freq(label == 1) \(0.58064516129\)
% Sampling from patch only: freq(label == 0) \(0.41935483871\)
% \item \textbf{10\% Noise} 655 (\(Y_{ij}=\mathbf{1}\)) pixels, 492 \(Y_{ij}=\mathbf{-1}\) pixels
% Sampling from whole image: freq(\(Y_{ij} = 1\)) \(\approx 0.571054925894\)
% Sampling from whole image: freq(\(Y_{ij}= 0\)) = \(\approx 0.428945074106\)
% \item \textbf{20\% Noise} 647 (\(Y_{ij}=\mathbf{1}\)) pixels, 500 0s
% Sampling from patch only: freq(label == 1) 0.564080209241 % Sampling from patch only: freq(label == 0) 0.435919790759
% \item \textbf{20\% Noise} 641 (\(Y_{ij}=\mathbf{1}\)) pixels, 506 0s
% Sampling from whole image: freq(label == 1) 0.558849171752 % Sampling from whole image: freq(label == 0) 0.441150828248 % \end{itemize}
\begin{figure} \centering \begin{subfigure}{.5\textwidth} \centering \includegraphics[width=\linewidth]{bar_noisy_10sampling_patch.png} \caption{\(10\%\) noise.} \label{fig:sub1} \end{subfigure}% \begin{subfigure}{.5\textwidth} \centering \includegraphics[width=\linewidth]{bar_noisy_20sampling_patch.png} \caption{\(20\%\) noise.} \label{fig:sub2} \end{subfigure} \caption{\(y_{ij}=1\) vs. \(y_{ij}=0\) frequencies when sampling the square that contains the letter ``Z.” } \label{fig:test} \end{figure}
Metropolis Hastings for Data Association
MCMC is also useful for data association under noisy measurements. Suppose we have \(K\) objects and \(K\) observations.
Oh, Russell, Sastry
We wish to estimate a distribution \(P\) given by
\begin{equation} P(C \mid v_1, \dots, v_k) = \frac{ P(v_1, \dots, v_k \mid C) P(C) }{ P(v_1, \dots, v_k) } \end{equation}
with Metropolis-Hastings MCMC. Metropolis-Hastings dictates an acceptance probability of
\begin{equation} A(x^{\ast} \mid x) = \mbox{min}\Bigg(1, \frac{P(x^{\ast} ) Q(x \mid x^{\ast} )}{P(x)Q(x^{\ast} \mid x)}\Bigg) \end{equation}
Our distribution \(Q\) is just randomly sampling from a uniform distribution, so \(Q(C_i,C_j) \propto \frac{1}{\text{Val}(C)} \cdot \frac{1}{\text{Val}(C)}\). If \(Q\) is uniform, then
\begin{equation} \frac{Q(x \mid x^{\ast})}{Q(x^{\ast} \mid x)} = 1 \end{equation}
With a uniform \(Q\), then a general MH acceptance probability becomes
\begin{equation} A(x^{\ast} \mid x) = \mbox{min}\Bigg(1,\frac{P(x^{\ast})}{P(x)}\cdot 1 \Bigg) \end{equation}
Consider our specific problem: Plugging in the definition for our distribution interest, where \(C^{\prime}\) is our new sample, and \(C\) is our previous sample, we see \begin{equation} A(C^{\prime} \mid C) = \mbox{min}\Bigg(1,\frac{P(C^{\prime} \mid v_1, \dots, v_k)}{P(C \mid v_1, \dots, v_k)}\cdot 1 \Bigg) \end{equation} By Bayes’ Rule, \begin{equation} A(C^{\prime} \mid C) = \mbox{min}\Bigg(1,\frac{ \frac{ P(v_1, \dots, v_k \mid C^{\prime}) P(C^{\prime}) }{ P(v_1, \dots, v_k)} }{ \frac{ P(v_1, \dots, v_k \mid C) P(C) }{ P(v_1, \dots, v_k) } }\Bigg) \end{equation} The expression \(P(v_1, \dots, v_k)\) appears in both the numerator and denominator, so it cancels
\begin{equation} A(C^{\prime} \mid C) = \mbox{min}\Bigg(1,\frac{ P(v_1, \dots, v_k \mid C^{\prime}) P(C^{\prime}) }{ P(v_1, \dots, v_k \mid C) P(C) }\Bigg) \end{equation} Since each correspondence variable \(C_i\) is equally likely, we know \(p(C_i)\) is uniform. These probabilities \(p(C_i)\) for all \(i\) are derived from permutation over \(1,\dots,K\), where all permutations are equally likely \textit{a priori}. Thus,
\begin{equation} \frac{ P(C^{\prime}) }{ P(C) } = 1 \end{equation} We now have \begin{equation} A(C^{\prime} \mid C) = \mbox{min}\Bigg(1,\frac{ P(v_1, \dots, v_k \mid C^{\prime}) }{ P(v_1, \dots, v_k \mid C ) }\Bigg) \end{equation}
We can see that given all of the correspondence variable (evidence) \(C_1, \dots, C_i, \dots, C_k\), then the observations are independent of one another. If we were not given \(C_1, \dots, C_i, \dots, C_k\), they would be \textbf{dependent} – we can’t use one integer from \([1,K]\) more than once.
Thus, \(v_i \perp v_j \mid C_i,C_j\) for \(i \in \{1,\dots,K\}\).
More precisely, \begin{equation} \begin{array}{ll} v_i \perp v_j \mid C_1, \dots,C_K, & \mbox{for all } i, j \end{array} \end{equation} so we know \begin{equation} P(v_1, \dots, v_k \mid C_1, \dots, C_i, \dots, C_k ) = \prod\limits_{k=1}^K P_k(v_i = a_l \mid C_i = k) \end{equation}
\begin{equation} A(C^{\prime} \mid C) = \mbox{min}\Bigg(1,\frac{ \prod\limits_{k=1}^K P_k(v_i = a_l \mid C_i^{\prime} = k) }{ \prod\limits_{k=1}^K P_k(v_i = a_l \mid C_i = k) }\Bigg) \end{equation}
\item Given \(M\) collected samples \((C_1[m],\dots,C_K[m])\) for \(m = 1, \dots , M\) after the chain has mixed, we can obtain an estimate for the marginal by \begin{equation} P(C_i \mid v_1, \dots, v_K) \approx \frac{1}{M}\sum\limits_{m=1}^M \mathbb{1}{C_1[m]=i} \end{equation}
We are estimating how frequently this object is associated with this thing.
\newpage \item Gibbs’ sampling utilizes an acceptance probability of 1 every time. This means we will only converge to the mode and can’t explore the full distribution.
In general, Gibbs sampling is useful for undirected models because when our proposal distribution is the conditional distribution \(p(x^{\prime} \mid x)\), then we need only condition on the Markov blanket (neighbors in an undirected graph). Here that IS OR IS NOT USEFUL BECAUSE
References
-
Monte Carlo Localization for Mobile Robots. PDF.
-
Gibbs sampling. Wikipedia
-
David J. C. Mackay. Information Theory, Inference, and Learning Algorithms. PDF.
-
Wainwright and Jordan. Graphical models, exponential families, and variational inference. PDF
-
Oh, Russell, Sastry. Markov Chain Monte Carlo Data Association for Multiple-Target Tracking. PDF.
-
Stuart Geman AND Donald Geman. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. TPAMI, 1984. PDF.