Fast Nearest Neighbors

Table of Contents:

The Nearest Neighbor Problem
Nearest Neighbor Computation
Brute Force Nearest Neighbors
Brute Force Compiled w/ Numba JIT
Vectorization: Speedup via Numpy Broadcasting
Vectorization: Quadratic Expansion Derivation
Implementation: Fast Affinity Matrix Computation via Quadratic Expansion
Speed Comparison

The Nearest Neighbor Problem

Finding closest points in a high-dimensional space is a re-occurring problem in computer vision, especially when performing feature matching (e.g. with SIFT [Lowe, 2004]) or computing Chamfer distances [Fan et al.], [Kurenkov et al.] for point set generation with deep networks. It is a cheaper way of finding point-to-point correspondences than optimal bipartite matching, as the Earth Mover’s Distance requires.

Brute force methods can be prohibitively slow and much faster ways exist of computing with a bit of linear algebra.

Nearest Neighbor Computation

Let \(\mathcal{A,B}\) be sets. We are interested in the finding the nearest neighbor for each point in \(\mathcal{A}\). Let \(a,b \in \mathbb{R}^n\) be two points such that \(a \in \mathcal{A}\), \(b \in \mathcal{B}\). The nearest neighbor in \(\mathcal{B}\) of a point \(a \in \mathcal{A}\) is a point \(b \in \mathcal{B}\), such that \(b = \mbox{arg} \underset{b \in \mathcal{B}}{\mbox{ min}} \|a-b\|_2\). We can equivalently use the squared Euclidean distance \(\|a-b\|_2^2\), since the square function is monotonically increasing for positive values, and distances are always positive. We will see that using the squared Euclidean distance to find the nearest neighbor will spare us some computation later.

The expression \(\|a-b\|_2^2\) is equivalent to an inner product. It is equivalent to the Mahalanobis distance, \((a-b)^TA(a-b)\), when \(A=I\), when working in Euclidean space (computing \(\ell_2\) norms).

Brute Force Nearest Neighbors

In the brute force regime, we would loop through all points \(a_i \in \mathcal{A}\), and then loop through all points \(b_j \in \mathcal{B}\), and find the distance \(\|a_i - b_j\|\) with np.linalg.norm(A[i] - B[j]). This can be done with a double for-loop in Python:

def naive_upper_triangular_compute_affinity_matrix(
    A: np.ndarray,
    B: np.ndarray
) -> np.ndarray:
    """Compute squared Euclidean distances between two point sets `A` and `B`.

    Args:
        A: Array of shape (m1, n).
        B: Array of shape (m2, n).

    Returns:
        Array of shape (m1,m2) representing squared Euclidean distances (not necessarily symmetric).
    """
    # Ensure feature vectors have the same length.
    if A.shape[1] != B.shape[1]:
        raise ValueError("Point sets A and B must have points of identical dimension.")

    m1, n = A.shape
    m2, n = B.shape

    affinity_mat = np.zeros((m1,m2))
    for i in range(m1): # rows
        for j in range(m2): # cols
            diff = A[i] - B[j]
            norm = np.linalg.norm(diff)
            affinity_mat[i,j] = norm

    # Affinity matrix contains the Mahalanobis distances.
    return np.square(affinity_mat)

However, this method will be brutally slow for thousands, tens of thousands, or millions of points, which are quite common point cloud sizes in computer vision or robotics. We need a better way.

Brute Force Speedup with Numba JIT

A trivial way to speed up the double-for loops is to compile the code with Numba JIT.

from numba import jit

@jit(nopython=True)
def naive_upper_triangular_compute_affinity_matrix(
    A: np.ndarray,
    B: np.ndarray
) -> np.ndarray:
   ...

This will give us some savings, but we can do much better.

Vectorization: Speedup via Numpy Broadcasting

Suppose we reshape \(B\) from B \(\in \mathbb{R}^{m_2, n}\) to B \(\in \mathbb{R}^{m_2, 1, n}\). Then, if we use A \(\in \mathbb{R}^{m_1, n}\) in Numpy, \(A - B\) would have shape \((m_2, m_1, n)\). Taking the norm along axis=2 would yield \((m_2, m_1)\), which can be reshaped to \((m_1, m_2)\):

def compute_affinity_matrix_broadcasting(A: np.ndarray, B: np.ndarray) -> np.ndarray:
    """Compute squared Euclidean distances between two point sets `A` and `B`.

    Args:
        A: Array of shape (m1, n).
        B: Array of shape (m2, n).

    Returns:
        Array of shape (m1,m2) representing squared Euclidean distances (not necessarily symmetric).
    """
    _, n = A.shape
    m2, _ = B.shape
    dists = np.linalg.norm(A - B.reshape(m2,1,n), axis=2)
    return np.square(dists.T)

It turns out this is much faster than brute force, but there is another approach that can be yet faster.

Vectorization: Quadratic Expansion Derivation

Now we wish to compute these distances on all pairs of points in entire datasets simultaneously. We can form matrices \(A \in R^{m_1 \times n}, B \in R^{m_2 \times n}\) that hold our high-dimensional points.

Let \(a,b\) be vectors, i.e. \(a,b \in \mathbb{R}^n\) are vectors:

\[\begin{aligned} \|a-b\|_2^2 &= (a-b)^T(a-b) \\ &= (a^T-b^T)(a-b) \\ &= a^Ta -b^Ta - a^Tb + b^Tb \end{aligned}\]

Since \(-b^Ta\) and \(- a^Tb\) are scalars (inner products), we can swap the order or arguments to find:

\[\begin{aligned} \|a-b\|_2^2 &= a^Ta -a^Tb - a^Tb + b^Tb \\ &= a^Ta -2 a^Tb + b^Tb \end{aligned}\]

We will see that nearest neighbor computation for all points boils down to only 3 required matrix products: \(AA^T, BB^T, AB^T\).

Our goal is to find \(\|a_i - b_j\|_2^2 = a_i^Ta_i -2 a_i^Tb_j + b_j^Tb_j\) for all \(i,j\). We wish to build an affinity matrix \(D\) such that the \(D_{ij}\) entry contains the squared distance between \(a_i, b_j\).

Consider a sets \(\mathcal{A,B}\) with 3 points each. We form \(AA^T, BB^T\):

\[AA^T = \begin{bmatrix} - & a_1 & - \\ - & a_2 & - \\ - & a_3 & - \end{bmatrix} \begin{bmatrix} | & | & | \\ a_1^T & a_2^T & a_3^T \\ | & | & | \end{bmatrix} = \begin{bmatrix} a_1^Ta_1 & a_1^T a_2 & a_1^T a_3 \\ a_2^Ta_1 & a_2^T a_2 & a_2^T a_3 \\ a_3^Ta_1 & a_3^T a_2 & a_3^T a_3 \end{bmatrix}\] \[BB^T = \begin{bmatrix} - & b_1 & - \\ - & b_2 & - \\ - & b_3 & - \end{bmatrix} \begin{bmatrix} | & | & | \\ b_1^T & b_2^T & b_3^T \\ | & | & | \end{bmatrix} = \begin{bmatrix} b_1^T b_1 & b_1^T b_2 & b_1^T b_3 \\ b_2^T b_1 & b_2^T b_2 & b_2^T b_3 \\ b_3^T b_1 & b_3^T b_2 & b_3^T b_3 \end{bmatrix}\]

We are interested only in the diagonal elements \(a_i^Ta_i\) and \(b_i^Tb_i\). We will define \(T_A\) and \(T_B\) to contain tiled rows, where each row is a diagonal of \(AA^T\) or \(BB^T\), respectively:

\[\begin{array}{ll} T_A = \begin{bmatrix} a_1^Ta_1 & a_2^Ta_2 & a_3^Ta_3 \\ a_1^Ta_1 & a_2^Ta_2 & a_3^Ta_3 \\ a_1^Ta_1 & a_2^Ta_2 & a_3^Ta_3 \end{bmatrix}, & T_B = \begin{bmatrix} b_1^T b_1 & b_2^T b_2 & b_3^T b_3 \\ b_1^T b_1 & b_2^T b_2 & b_3^T b_3 \\ b_1^T b_1 & b_2^T b_2 & b_3^T b_3 \end{bmatrix} \end{array}\]

We now form \(AB^T\):

\[AB^T = \begin{bmatrix} - & a_1 & - \\ - & a_2 & - \\ - & a_3 & - \end{bmatrix} \begin{bmatrix} | & | & | \\ b_1^T & b_2^T & b_3^T \\ | & | & | \end{bmatrix} = \begin{bmatrix} b_1^Ta_1 & b_2^Ta_1 & b_3^Ta_1 \\ b_1^Ta_2 & b_2^Ta_2 & b_3^Ta_2 \\ b_1^Ta_3 & b_2^Ta_3 & b_3^Ta_3 \end{bmatrix}\]

Our desired affinity matrix \(D \in \mathbf{R}^{3 \times 3}\) will contain entries \(D_{ij} = \|a_i - b_j\|_2^2\):

\[D = \begin{bmatrix} \| a_1 - b_1 \|_2^2 & \| a_1 - b_2 \|_2^2 & \|a_1 - b_3 \|_2^2 \\ \| a_2 - b_1 \|_2^2 & \| a_2 - b_2 \|_2^2 & \|a_2 - b_3 \|_2^2 \\ \| a_3 - b_1 \|_2^2 & \| a_3 - b_2 \|_2^2 & \|a_3 - b_3 \|_2^2 \end{bmatrix}\]

In turns out that:

\[\begin{aligned} D &= T_A^T + T_B - 2 AB^T \\ D &= \begin{bmatrix} a_1^Ta_1 & a_1^Ta_1 & a_1^Ta_1 \\ a_2^Ta_2 & a_2^Ta_2 & a_2^Ta_2 \\ a_3^Ta_3 & a_3^Ta_3 & a_3^Ta_3 \end{bmatrix} + \begin{bmatrix} b_1^T b_1 & b_2^T b_2 & b_3^T b_3 \\ b_1^T b_1 & b_2^T b_2 & b_3^T b_3 \\ b_1^T b_1 & b_2^T b_2 & b_3^T b_3 \end{bmatrix} - 2 \begin{bmatrix} b_1^Ta_1 & b_2^Ta_1 & b_3^Ta_1 \\ b_1^Ta_2 & b_2^Ta_2 & b_3^Ta_2 \\ b_1^Ta_3 & b_2^Ta_3 & b_3^Ta_3 \end{bmatrix} \end{aligned}\]

Since as you can see above, \(D_{ij} = \|a_i - b_j\|_2^2 = a_i^Ta_i -2 a_i^Tb_j + b_j^Tb_j\) for all \(i,j\).

Implementation: Fast Affinity Matrix Computation via Quadratic Expansion

The implementation requires just a few lines in Numpy:

import numpy as np

def fast_affinity_mat_compute(A: np.ndarray, B: np.ndarray) -> np.ndarray:
    """Compute squared Euclidean distances between two point sets `A` and `B`.

    Args:
        A: Array of shape (m1, n).
        B: Array of shape (m2, n).

    Returns:
        Array of shape (m1,m2) representing squared Euclidean distances (not necessarily symmetric).
    """
    # Ensure feature vectors have the same length.
    if A.shape[1] != B.shape[1]:
        raise ValueError("Point sets A and B must have points of identical dimension.")

    m1, n = A.shape
    m2, _ = B.shape

    ab_T = np.matmul(A, B.T )
    a_sqr = np.diag(A.dot(A.T))
    b_sqr = np.diag(B.dot(B.T))
    a_sqr = np.tile( a_sqr, (m2,1) ).T
    b_sqr = np.tile( b_sqr, (m1,1) )
    return a_sqr + b_sqr - 2 * ab_T

Speed Comparison

We now demonstrate the runtime of the three approaches approach. We’ll use the following snippet:

import time


def unit_test_arr_arr(m1: int, m2: int, n: int) -> None:
    """ """
    A = np.random.rand(m1,n) # Point set 1.
    B = np.random.rand(m2,n) # Point set 2.

    start = time.time()
    gt_aff = naive_upper_triangular_compute_affinity_matrix(A, B)
    end = time.time()
    duration = end - start
    print(f"Duration: {duration:.3f}")

    start = time.time()
    pred_aff2 = compute_affinity_matrix_broadcasting(A, B)
    end = time.time()
    duration = end - start
    print(f"Duration: {duration:.3f}")

    start = time.time()
    pred_aff = fast_affinity_mat_compute(A, B)
    end = time.time()
    duration = end - start
    print(f"Duration: {duration:.3f}")

    print("Sum of abs differences: ", np.absolute(gt_aff - pred_aff).sum())
    assert np.allclose(gt_aff, pred_aff, atol=1e-2)

    print("Sum of abs differences: ", np.absolute(gt_aff - pred_aff2).sum())
    assert np.allclose(gt_aff, pred_aff2, atol=1e-2)


if __name__ == '__main__':

    m1 = 1000
    m2 = 1350
    n = 128

    np.random.seed(2)
    unit_test_arr_arr(m1, m2, n)

Now consider the speedup we’ve achieved when we measure runtimes:

\(m_1, m_2, n\)	Brute Force	Brute Force w/ JIT	Broadcasting	Vectorized Quad. Expansion
10k, 1350, 3	60.2 s.	6.6 s.	1.1 s.	1.3 s.
1000, 1350, 128	7.2 s.	2.5	2.0 s.	0.07 s.
5000, 5000, 128	101.6 s.	17.5	OOM	1.3 s.

Using a GPU

We’ll now implement both vectorized methods for GPU using Pytorch, and run another speed comparison.

Well-discussed topic in the Pytorch forums.

import torch

def pairwise_distances(
    A: torch.Tensor,
    B: Optional[torch.Tensor] = None
) -> torch.Tensor:
    """Compute squared Euclidean distances between two point sets `A` and `B`.

    Args:
        A: Array of shape (m1, n).
        B: Array of shape (m2, n). If none, self-distances w/ `A` are computed.

    Returns:
        Array of shape (m1,m2) representing squared Euclidean distances (not necessarily symmetric).
    """
    A_norm = (A**2).sum(1).view(-1, 1)
    if B is not None:
        B_t = torch.transpose(B, 0, 1)
        B_norm = (B**2).sum(1).view(1, -1)
    else:
        # Self-distances.
        B_t = torch.transpose(A, 0, 1)
        B_norm = A_norm.view(1, -1)
    
    dist = A_norm + B_norm - 2.0 * torch.mm(A, B_t)
    # Ensure diagonal is zero if A=B
    # if B is None:
    #     dist = dist - torch.diag(dist.diag)
    return torch.clamp(dist, 0.0, np.inf)

See open-UCN

References

David Lowe. Distinctive Image Features from Scale-Invariant Keypoints. IJCV, 2004. [PDF].
H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction from a single image. CVPR 2017. [PDF].
A. Kurenkov, J. Ji, A. Garg, V. Mehta, J. Gwak, C. B. Choy, and S. Savarese. Deformnet: Free-form deformation network for 3d shape reconstruction from a single image. WACV 2018. [PDF].