Table of Contents:

The Need for Dimensionality Reduction

When the dimensionality of data points is high, interpretation or visualization of the data can be quite challenging.

Linear Methods

Principal Components Analysis (PCA)

PCA is an algorithm for linear dimensionality reduction that can be surprisingly confusing. There are many different ways to explain the algorithm, and it is not new – Karl Pearson (University College, London) introduced the ideas in 1901 and Harold Hostelling (Columbia) developed the terminology and more of the math about principal components in 1933.

To understand PCA, we must first consider data as points in a Euclidean space. Suppose the lies in a high-dimensional subspace, but the real space of the data could be low-dimensional. We would like to identify the real, lower-dim subspace where lower-dim structure corresponds to linear subspaces in the high-dim space.

PCA has one key hyperparameter: the dimensionality of the lower-dim space. How many dimensions do we need to explain most of the variability in the data?

The key idea is that we will project the points from high-D to low-D, and then when we project them back into high-D, the “reconstruction error” should be minimized.

\item Pearson (1901) – find a single lower dimensional subspace that captures most of the variation in the data \item What should be the dimensionality of the lower-dim space?

Acts as a proxy for the real space

Minimize errors introudced by projecting the data into this subspace

Assumption: data has zero mean, ``centered data’’

Move the origin to the middle of the data, the data will live in the hyperplane. Becomes subspace in new translated coordinate system

As a consequence, variance becomes just the 2nd moment

Data points in 2D -> project onto the best fit line (projection onto 1D), then assess how well the data looks now when you lay out the projected points onto the best fit line (this is the reconstruction) Bring it back onto the best fit line (multiply by vector)

\[x_{||} = vv^Tx\]

Minimize an error function: error equivalent to maximizing the variance of the projected data

\[E = \frac{1}{M} \sum\limits_I \sum\limits_M ... = < \|x^{\prime \mu} - x_{||}^{\prime \mu}\|^2 >_{\mu}\]

\(I\) dimensions WHY IS THIS? Trying to find the directions of maximum variance Knowing the covariance matrix tells us about the 2nd order moments, but not about the highest-order structure of the data (unless data is normal)

If the covariance matrix is diagonal, then finding the direction of maximum variance is easy

Rotate your coordinate system so that the covariance matrix becomes diagonal.

This becomes an eigenvalue problem: the eigenvectors of the covariance matrix point in the directions of maximal variance?

Solve all top-k possible subspaces

Valid for all of them

Check the error depending on many dimensions we used. A tradeoff \(\mathbf{V}^T \mathbf{V} = I\) because \(\mathbf{V}\) is an orthonormal matrix Projection \(x_{||} = \mathbf{V}y = \mathbf{V}\mathbf{V}^Tx\)

Projection operator = \(\mathbf{V}\mathbf{V}^T\)

But in general \(P\) is of low rank and loses info

Variance and reconstruction error

\[<x^Tx> - <y_{||}^T y_{||}>\]

Want to be able to maximize the variance in the projected subspace… (FIND OUT WHY THAT IS)

Spectral analysis of the covariance matrix: if covariance matrix is symmetric, it always has real eigenvalues and orthogonal eigenvectors

The total variance fo the data is just the sum of the eigenvalues of the covariance matrix, all non-negative

Have diagonalized the covariance matrix, aligned, which was what we set out to do

This is an extremal trace problem?

\[\sum\limits_i \lambda_i \sum\limits_p (v_{ip}^{\prime})^2\] \[= \mbox{tr }(V^{\prime T} \Lambda V^{\prime})\]

To maximize the variance, need to put as mucch weight as possible on the large eigenvalues

The reconstruction error is

\[\sum\limits_{i=P+1}^I \lambda_i\]

Aligns the axis with your data, so that for any \(P\), can find \(P\)-dimensional subspace

Eigenvectors Solve PCA

Main Lesson from PCA: For any P, the optimal \(P\)-dimensional subspace is the one spanned by the first \(P\) eigenvectors of the covariance matrix

The eigenvectors give us the frame that we need – align with it

PCA Examples

First principal component does not always have semantic meaning (combination of arts, recreation, transportation, health, and housing explain the ratings of places the most)

This is simply what the analysis gives.

The math is beautiful, but semantic meaning is not always clear. Sparse PCA tries to fix this (linear combinations of only a small number of variables)

Geography of where an individual came from is reflected in their genetic material

Machinery gives us a way to understand distortions of changes: a recipe encoded as a matrix

PCA works very well here

View matrices as points in high-dimensional space

Recover grid structure for deformed sphere

Get circular pattern of galloping horse from points

SVD Trick

What if we have fewer data points than dimensions \(M<I\)? Think of images…

By SVD, transpose your data

Think of your data as rows, not columns

Data becomes first pixel across 1000 images

Covariance matrix \(C_2\) of transposed problem

Rows of transformed X (PCA’d X) are just eigenvectors of \(C_1\)

Corresponding eigenvalues are just those of \(C_2\), scaled by \(I/M\)

ntroduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data in terms of a set of uncorrelated variables. PCA looks for a single lower dimensional subspace that captures most of the variation in the data. Specifically, we aim to minimize the error introduced by projecting the data into this linear subspace.

Use spectral analysis of the covariance matrix C of the data For any integer p, the error-minimizing p-dimensional subspace is the one spanned by the first p eigenvectors of the covariance matrix

Eigenfaces (PCA on face images) Turk and Pentland, 1991

https://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch18.pdf

http://theory.stanford.edu/~tim/s17/l/l8.pdf

SVD Transpose Trick

We multiply the eigenvectors of (data . data.T) / lines on the left by data.T, in order to get the eigenvectors of (data.T . data) / lines. This is sometimes called the “transpose trick”.

Suppose you have a matrix \(A\) that you want to perform PCA on; for simplicity, suppose that the columns of \(A\) have already been normalized to have zero mean, so that we just need to compute the eigenvectors of the covariance matrix \(A^TA\).

Now if \(A\) is an \(m \times n\) matrix, with \(n \gg m\), then \(A^TA\) is a very large \(n \times n\) matrix. So instead of computing the eigenvectors of \(A^TA\), we might like to compute the eigenvectors of the much smaller \(m \times m\) matrix \(AA^T\) – assuming we can figure out a relationship between the two. So how are the eigenvectors of \(A^TA\) related to the eigenvectors of \(AA^T\)?

Let \(v\) be an eigenvector of \(AA^T\) with eigenvalue \(\lambda\). Then

\[\begin{aligned} AA^Tv &= \lambda v \\ A^T(AA^Tv) &= A^T(\lambda v) \\ (A^TA)(A^Tv) &=\lambda(A^Tv) \end{aligned}\]

In other words, if \(v\) is an eigenvector of \(AA^T\), then \(A^Tv\) is an eigenvector of \(A^TA\), with the same eigenvalue. So when performing a PCA on 𝐴, instead of directly finding the eigenvectors of \(A^TA\) (which may be very expensive), it’s easier to find the eigenvectors \(v\) of \(AA^T\) and then multiply these on the left by \(A^T\) to get the eigenvectors \(A^Tv\) of \(A^TA\).

See StackOverflow post on this topic.

Eigenfaces implementation.

\begin{enumerate}[label=(\alph*)]
\item 
\begin{figure}[!h]
\includegraphics[width=0.4\linewidth]{cs233_sad_face.png}
\includegraphics[width=0.4\linewidth]{cs233_happy_face.png}
\includegraphics[width=0.4\linewidth]{cs233_mean_face.png}
\caption{Mean faces from the training set. Top Left: Sad. Top Right: Happy. Below Left: General training set mean.}
\end{figure}

\begin{lstlisting}[language=MATLAB,caption=Mean Faces]
% ------------------------
% CS233 HW1
% Problem 1  Starter Code
% ------------------------
clear all; close all; clc;
colormap gray
addpath('data_p1');
rng(0); 

% Load the face data set
extensions = {'centerlight', 'glasses', 'happy', 'leftlight', ...
    'noglasses', 'normal', 'rightlight', 'sad', 'sleepy', 'surprised', 'wink' };
load('YaleFaces');

imH = 159; imW = 159;

% -----------------------------------------
% TODO: (a) visualize mean faces from train set
% use imagesc for plotting

mean_face = zeros(imH,imW);
happy_mean_face = mean_face;
sad_mean_face = mean_face;

num_train = 0;
num_happy = 0;
num_sad = 0;

% Extract the faces into training and test faces
trainidx = 1;
testidx = 1;
for i = 1:length(Faces),
    for j = 1:length(extensions);
        
%         if strcmp(extensions{j}, 'leftlight' ) || ...
%             strcmp(extensions{j},'rightlight' ) || ...
%             strcmp(extensions{j}, 'glasses' )
%             continue
%         end
        
        X = double(Faces(i).(extensions{j}));
        if( rand > 0.3 )
            train(trainidx).data = X;
            train(trainidx).label = i;
            trainidx = trainidx + 1;
            % TODO: add code here to compute mean face
            % filter extenstions for happy and sad faces
            
            if strcmp(extensions{j}, 'happy')
                happy_mean_face = happy_mean_face + X;
                num_happy = num_happy + 1;
            end
            if strcmp(extensions{j}, 'sad')
                sad_mean_face = sad_mean_face +  X;
                num_sad = num_sad + 1;
            end
            mean_face = mean_face + X;
            num_train = num_train + 1;
        else
            test(testidx).data = X;
            test(testidx).label = i;
            testidx = testidx + 1;
        end;
    end;
end;

mean_face = mean_face * (1.0 / num_train );
happy_mean_face = happy_mean_face * (1.0 / num_happy );
sad_mean_face = sad_mean_face* (1.0 / num_sad );
\end{lstlisting}

\item

\newpage
\item 

\begin{figure}
\includegraphics[width=0.7\linewidth]{cs233_eigenfaces_u1.png}
\caption{Top 25 eigenfaces}
\end{figure}
\newpage
\item
We have fewer data points than dimensions. We have only 110 images in the training set, each of dimensions $\mathbbm{R}^{25281}$.

Our data matrix $X$ is composed of vectorized $I$-dimensional images, where $X \in \mathbbm{R}^{I \times M}$ because we have $M$ examples. $M <I$ in our example:
\begin{equation}
\mathbf{X} = (\mathbf{x}^1, \dots, \mathbf{x}^M) \in \mathbbm{R}^{I \times M} =\mathbbm{R}^{25281 \times 110}
\end{equation}
The second moment matrix is 
\begin{equation}
\begin{array}{ll}
C_1 = \mathbf{X}  \mathbf{X}^T / M, & C_1 \in \mathbbm{R}^{25281 \times 25281}
\end{array}
\end{equation}
Finding an Eigendecomposition or computing the SVD of a matrix of this size would be unbelievably slow, but if we were to do it, we would find
\begin{equation}
\begin{array}{ll}
C_1 U_1 = U_1 \Lambda_1 &
\end{array}
\end{equation}
\begin{equation}
C_1  = U_1 \Lambda_1 U_1^T
\end{equation}
The data represented in the coordinate system of the eigenvectors is
\begin{equation}
\begin{array}{ll}
Y_1 = U_1^T \mathbf{X}, & Y_1 \in \mathbbm{R}^{ ? \times ? }
\end{array}
\end{equation}
which is still high-dimensional.

Another approach is to interpret the data matrix $\mathbf{X}$ transposed, i.e. swapping the data point index for the dimension index. In our example, this is equivalent to having 25,281 data points in 110 dimensional space, which is much easier to deal with. 
\begin{equation}
\begin{array}{ll}
C_2 =  \mathbf{X}^T \mathbf{X}   / I, & C_2 \in \mathbbm{R}^{110 \times 110}
\end{array}
\end{equation}
\begin{equation}
C_2 U_2 = U_2 \Lambda_2
\end{equation}
\begin{equation}
\begin{array}{ll}
C_2 = U_2 \Lambda_2  U_2^T, & U_2 \in \mathbbm{R}^{110 \times 110}
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
Y_2 = U_2^T \mathbf{X}^T, & Y_2 \in \mathbbm{R}^{110 \times 25281}
\end{array}
\end{equation}
\begin{equation}
\begin{array}{ll}
U_1 = Y_2^T, & U_1 \in \mathbbm{R}^{ 25281 \times 110 }
\end{array}
\end{equation}
We can easily show that the rows of the matrix $Y_2$ are the eigenvectors of matrix $C_1$:
\begin{equation}
C_1 Y_2^T = (\mathbf{X} \mathbf{X}^T   / M)(\mathbf{X} U_2)
\end{equation}
\begin{equation}
= \mathbf{X} (\mathbf{X}^T\mathbf{X}   / I)( U_2) I/M
\end{equation}
\begin{equation}
 = \mathbf{X} C_2 U_2) I/M
\end{equation}
\begin{equation}
 = \mathbf{X} U_2 \Lambda_2 I/M
\end{equation}
\begin{equation}
= Y_2^T \Lambda_2 I/M
\end{equation}



The corresponding eigenvalues are eigenvalues of $C_2$ scaled by $I/M$.

\newpage
\item 

\begin{figure}[!h]
\includegraphics[width=0.5\linewidth]{cs233_face_to_reconstruct.png}
\caption{Original Image.}
\end{figure}

\begin{figure}[!h]
\includegraphics[width=0.8\linewidth]{cs233_low_rank_reconstructions}
\caption{Visualization of the reconstructed image using 1, 10, 20, 30, 40, 50 components.}
\end{figure}

\begin{figure}[!h]
\includegraphics[width=0.8\linewidth]{cs233_reconstruction_with_only_clean_images.png}
\caption{When we use clean images only, we discover that we can create a reasonable reconstruction with 10 fewer components.}
\end{figure}


\item 

     4 dimensions needed to exceed 50\% \\


    94.58\% of energy with 50 dims


\begin{lstlisting}[language=MATLAB,caption=Eigenfaces]
% ------------------------
% CS233 HW1
% Problem 1  Starter Code
% ------------------------
clear all; close all; clc;
colormap gray
addpath('data_p1');
rng(0); 

% Load the face data set
extensions = {'centerlight', 'glasses', 'happy', 'leftlight', ...
    'noglasses', 'normal', 'rightlight', 'sad', 'sleepy', 'surprised', 'wink' };
load('YaleFaces');

imH = 159; imW = 159;

% -----------------------------------------
% TODO: (a) visualize mean faces from train set
% use imagesc for plotting

mean_face = zeros(imH,imW);
happy_mean_face = mean_face;
sad_mean_face = mean_face;

% Extract the faces into training and test faces
trainidx = 1;
testidx = 1;
for i = 1:length(Faces),
    for j = 1:length(extensions);
        
%         if strcmp(extensions{j}, 'leftlight' ) || strcmp(extensions{j},
%         'rightlight' ) || strcmp(extensions{j}, 'glasses' )
%             continue
%         end
        
        X = double(Faces(i).(extensions{j}));
        if( rand > 0.3 )
            train(trainidx).data = X;
            train(trainidx).label = i;
            trainidx = trainidx + 1;
            % TODO: add code here to compute mean face
            % filter extensitions for happy and sad faces
            %
        else
            test(testidx).data = X;
            test(testidx).label = i;
            testidx = testidx + 1;
        end;
    end;
end;

for trainidx=1:size(train,2)
    mean_face = mean_face + double(train(trainidx).data);
end
mean_face = mean_face * (1.0 / size(train,2) );

for i=1:size(Faces,2)
    happy_mean_face = happy_mean_face + double(Faces(i).happy);
    sad_mean_face = sad_mean_face +  double(Faces(i).sad);
end
happy_mean_face = happy_mean_face * (1.0 / size(Faces,2) );
sad_mean_face = sad_mean_face* (1.0 / size(Faces,2) );


% Zero-center and put all train set images into a big matrix
idx = 1;
for i = 1:length(train),
    X = double(train(i).data);
    W = X - mean_face;
    M(:, idx) = W(:);
    idx = idx + 1;
end;


% -----------------------------------------
% TODO: (b) do PCA (by MATLAB function svd) on face
% images and plot the energy (sum of top-k variances/total sum) curve
%
C2 = M' * M / ( 1.0 * size(M,1) );
[U2,S,U2_dup] = svd(C2,'econ');
% U2 has dim 110 x 110

diag(S)

U1 = M * U2;
% U1 has dim (25281,110)

% -----------------------------------------
% TODO: (c) show top 25 eigen faces
% Hint: you can create a large matrix (image) and set each
% eigen face image to a portion of it.
big_image = zeros(imH*5,imW*5);
eigen_idx = 1;
for i = 1:5
    for j = 1:5
        % TODO: fill in eigen face below
        idx = ( (i-1) *5) + j;
        big_image((i-1)*imH+1:i*imH, (j-1)*imW+1:j*imW) = reshape( U1(:,idx), imH, imW);
    end
end
figure, imagesc(big_image), colormap('gray'); axis off;


% -----------------------------------------
% TODO: (d) face reconstrunction
% reconstruct the first face train(1).data with different
% number of principal components
%

% Plot the orignal face
orig_face1 = reshape(train(1).data, size(mean_face)); % + mean_face;
figure, imagesc(orig_face1); 
colormap(gray); axis off; %truesize; 

% Plot reconstructed faces, 
figure;
idx = 1;
for k = [1,10,20,30,40,50]
    
    % TODO: finish the reconstruction of using k eigen faces
    recon = M * U2(:,1:k) * U2(:, 1:k)';
    recon = reshape( recon(:,1), imH, imW);
    recon = recon + mean_face;
    subplot(2,3,idx), imagesc(recon);
    colormap(gray);
    axis off;
    
    idx = idx + 1;
end;


% -----------------------------------------
% TODO: (e) face recognition
% For each test example, find best match in training data
%

% Put all of the testing faces into one big matrix
idx = 1;
for i = 1:length(test),
    X = double(test(i).data);
    Wh = X - mean_face; % zero-center it
    T(:, idx) = Wh(:);
    idx = idx + 1;
end;

k = 25;
for i = 1:length(train)
    % TODO: fill in projections of train set images
    % use top 25 eigen vectors
    proj = M * U2(:,1:k);
    trainweights(:,i) = proj(:,i);
end;

figure;
figure_idx = 1;
row = 7;
col = ceil(length(test)/row);
accuracy_sum = 0;

for i = 1:length(test)

    % TODO: fill in projections of i-th test set image
    % use top 25 eigen vectors
    testweights = 999
    
    % TODO: nearest neighbor serach: 
    % find closest train image to this test image
    nearest_train_image = 999
    nearest_train_idx = 999
    
    % See if recognition is correct
    correct = (test(i).label == train(nearest_train_idx).label);
    accuracy_sum = accuracy_sum + correct;
    
    subplot(row,col,figure_idx), imagesc([test(i).data, nearest_train_image]);
    axis off; colormap(gray); drawnow;
    if correct
        title('\color{green}CORRECT');
    else
        title('\color{red}WRONG');
    end
    figure_idx = figure_idx + 1;
    
end;

accuracy = accuracy_sum / length(test);
fprintf('Face recognition accuracy: %f\n', accuracy);

% -----------------------------------------
% TODO: (f) recognition for non-face image
%

%Now load three images and project them to both the eigenfaces,
% and to the orthogonal complement
k = 50;
test_imgs = {'face1.jpg', 'face2.jpg', 'nonface1.jpg'};
for j=1:length(test_imgs),
    im = double(imread(test_imgs{j}));
    
    % TODO: use top-k eigen faces to reconstruct the image
    recon_im = 999
    
    
    recon = M * U2(:,1:k) * U2(:, 1:k)';
    recon = reshape( recon(:,1), imH, imW);
    recon = recon + mean_face;
    
    figure, imagesc([im recon_im]);
    colormap(gray); axis off;
    
    % TODO: show the original and reconstructed image's
    % relative norm difference
    norm_diff = 999
        
end;

\end{lstlisting}

\end{enumerate}

Multi-Dimensional Scaling (MDS)

Non-Linear Methods

Many data sets contain essential nonlinear structures and unfortunately these structures are invisible to linear methods like PCA [1].

For example, Euclidean distance between points cannot disentangle or capture the structure of manifolds like the “swiss roll.” Instead, we need geodesic distance: distance directly on the manifold.

Furthermore, PCA cares about large distances. The goal is to maximize variance, and variance comes from things that are far apart. If you care about small distances, PCA is the wrong tool to use.

Creating Graphs from Geometric Point Data

Rely upon neighborhood relations. A graph can be constructed via (1) k-nearest neighbors or (2) \(\epsilon\)-balls.

Isometric Feature Mapping (ISOMAP)

In the case of the swiss roll, Isometric Feature Mapping (ISOMAP) produces an unrolled, planar version of data that respects distances [5]. This is “isometric” unrolling.

The ISOMAP algorithm involves three main steps, as outlined by Guibas [1]:

  • (1.) Form a nearest-neighbor graph \(\mathcal{G}\) on the original data points, weighing the edges by their original distances \(d_x(i,j)\). One can build the NN-graph either (A) with a fixed radius/threshold \(\epsilon\), or by using a (B) fixed # of neighbors.
  • (2.) Estimate the geodesic distances \(d_{\mathcal{G}}(i,j)\) between all pairs of points on the sampled manifold by computing their shortest path distances in the graph \(\mathcal{G}\). This can be done with classic graph algorithms for all-pairs shortest-path algorithm (APSP)s: Floyd/Warshall’s algorithm or Dijkstra’s algorithm. Initially, all pairs given distance \(\infty\), except for neighbors, which are connected.
  • (3.) Construct an embedding of the data in \(d\)-dimensional Euclidean space that best preserves the inter-point distances \(d_{\mathcal{G}}(i,j)\). This is performed via Multi-Dimensional Scaling (MDS).

ISOMAP actually comes with recovery guarantees that discovered structure equal to actual structure of the manifold, especially as the graph point density increases. MDS and ISOMAP both converge, but ISOMAP gets there more quickly.

Locally Linear Embedding (LLE)

LLE is a method that learns linear weights that locally reconstruct points in order to map points to a lower dimension [1,4]. The method requires solving two successive optimization problems and requires a connectivity graph. Almost all methods start with the nearest neihgbor graph, since it’s the only thing that we can trust.

In the first step of LLE, we find weights that reconstruct each data point from its neighbors:

\[\begin{array}{ll} \underset{w}{\mbox{minimize }} & \| x_i - \sum\limits_{j \in N(i)} w_{ij}x_j \|^2 \\ \mbox{subject to} & \sum\limits_j w_{ij} = 1 \end{array}\]

We can use linear least squares with Lagrange multipliers to obtain optimal linear combinations.

In the second step of LLE, We then fix these weights \(w_{ij}\) while optimizing for \(x_i^{\prime}\). We try to find low-dimensional coordinates:

\[\begin{array}{ll} \underset{x_1^{\prime}, \dots, x_n^{\prime}}{\mbox{minimize }} \sum\limits_i \| x_i^{\prime} - \sum\limits_{j \in N(i)} w_{ij}x_j^{\prime} \|^2 \end{array}\]

This is a sparse eigenvalue problem that requires constraints in order to prevent degenerate solutions:

  • (1.) The coordinates \(x_i^{\prime}\) can be translated by a constant displacement without affecting the cost. We can remove this degree of freedom by requiring the coordinates to be centered on the origin.
  • (2.) We constrain the embedding vectors to have unit covariance. The optimization problem becomes:
\[\begin{array}{ll} \underset{x_1^{\prime}, \dots, x_n^{\prime}}{\mbox{minimize }} & \sum\limits_i \| x_i^{\prime} - \sum\limits_{j \in N(i)} w_{ij}x_j^{\prime} \|^2 \\ \mbox{subject to} & \sum\limits_i x_i^{\prime} = 0 \\ & \frac{1}{n} \sum\limits_i x_i^{\prime}x_i^{\prime T} = I \end{array}\]

These weights \(w_{ij}\) capture the local shape. As Roweis and Saul point out, “LLE illustrates a general principle of manifold learning…that overlapping local neighborhoods – collectively analyzed – can provide information about global geometry”” [4].

Stochastic Neighbor Embedding (SNE)

The Stochastic Neighbor Embedding (SNE) converts high-dimensional points to low-dimensional points by preserving distances. The method takes a probabilistic point of view: high-dimensional Euclidean point distances are converted into conditional probabilities that represent similarities [2,3].

Similarity of datapoints in high himension: The conditional probability is given by

\[p_{j \mid i} = \frac{\mbox{exp }\big( - \frac{\|x_i - x_j\|^2}{2 \sigma_i^2}\big) }{ \sum\limits_{k \neq i} \mbox{exp }\big( - \frac{\|x_i - x_k\|^2}{2 \sigma_i^2} \big)}\]

Similarity of datapoints in the low dimension:

\[q_{j \mid i} = \frac{\mbox{exp }\big( - \|y_i - y_j\|^2\big) }{ \sum\limits_{k \neq i} \mbox{exp }\big( - \|y_i - y_k\|^2 \big)}\]

If similarities between \(x_i,x_j\) are correctly mapped to similarities between \(y_i,y_j\) by SNE, then the conditional probabilities should be equal: \(q_{j \mid i} = p_{j \mid i}\).

SNE seeks minimize the following cost function using gradient descent, which measures the dissimilarity between the two distributions (Kullback-Leibler divergence):

\[C = \sum\limits_i KL(P_i || Q_i) = \sum\limits_i \sum\limits_j p_{j \mid i} \mbox{ log } \frac{p_{j \mid i}}{q_{j \mid i}}\]

This is known as asymetric SNE. The gradient turns out to be analytically simple:

\[\frac{\partial C}{\partial y_i} = 2 \sum\limits_j (p_{j \mid i} - q_{j \mid i} - p_{i \mid j} - q_{i \mid j} )(y_i - y_j)\]

However, the Kullback-Leibler divergence is not symmetric, so a formulation with a joint distribution can be made.

t-SNE

An improvement to SNE is t-Distributed Stochastic Neighbor Embedding (t-SNE). t-SNE employs a Gaussian in the high-dimension, but a t-Student distribution in low-dim. The t-Student distribution has longer tails than a Gaussian and is thus happier to have points far away than a Gaussian. The motivation for doing so is that in low-D, you have have less freedom than you would in the high-dimension to put many things closeby. This is because there is not much space around (crowded easily), so we penalize having points far away less.

The joint distribution in the low-distribution is:

\[q_{ij} = \frac{ (1+ \| y_i −y_j \|^2)^{−1} }{ \sum\limits_{k \neq l} (1+ \|y_k −y_l\|^2)^{−1} }\]

MNIST examples

References

[1] Leonidas Guibas. Multi-Dimensional Scaling, Non-Linear Dimensionality Reduction. Class lectures of CS233: Geometric and Topological Data Analysis, taught at Stanford University in 18 April 2018.

[2] Geoffrey Hinton and Sam Roweis. Stochastic Neighbor Embedding. Advances in Neural Information Processing Systems (NIPS) 2003, pages 857–864. http://papers.nips.cc/paper/2276-stochastic-neighbor-embedding.pdf.

[3] L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9 (Nov):2579-2605, 2008.

[4] Sam T. Roweis and Lawrence K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science Magazine, Vol. 290, 22 Dec. 2000.

[5] J. B. Tenenbaum, V. de Silva and J. C. Langford. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290 (5500): 2319-2323, 22 December 2000. PDF.

[6] Laurenz Wiskott. Principal Component Analysis. 11 March 2004. Online PDF.

[7] Karl Pearson. On Lines and Planes of Closest Fit to Systems of Points in Space. 1901. Philosophical Magazine. 2 (11): 559–572.

[8] H Hotelling. Analysis of a complex of statistical variables into principal components. 1933. Journal of Educational Psychology, 24, 417–441, and 498–520. Hotelling, H (1936). “Relations between two sets of variates”. Biometrika. 28 (3/4): 321–377. doi:10.2307/2333955. JSTOR 2333955.

[9] Matthew A. Turk and Alex P Pentland. “Face Recognition Using Eigenfaces.” CVPR 1991. PDF