Structure From Motion | John Lambert

Table of Contents:

A Basic SfM Pipeline
Cost Functions
Bundle Adjustment
Triangulation

Epipolar Geometry

1993 Luong non-linear least squares

A Basic Structure-from-Motion (SFM) Pipeline

As described in [1], there are generally four steps to the algorithm:

(1) Find interest points in each image
(2) Find candidate correspondences (match descriptors for each interest point)
(3) Perform geometric verification of correspondences (RANSAC + fundamental matrix)
(4) Solve for 3D points and camera that minimize reprojection error.

Cost Functions/Error Modelling

The choice of cost function quantifies the total prediction error of the model. It measures how well the model fits the observations and background knowledge.

Reprojection Error for a Single Keypoint in Two Images

Imagine we have matched two keypoints, \(x_1,x_2\) in two different images \(I_1,I_2\) via a SIFT-like feature matching pipeline. We are viewing the projection of the same 3D point \(\mathbf{X}\) in both images. We now wish to identify the coordinates describing the location of \(\mathbf{X}\).

If image \(I_1\) was captured with projection matrix \(M_1\), and image \(I_2\) was captured with projection matrix \(M_2\), we can enforce that the 3D point \(\mathbf{X}\) is projected into the image into the right location in both images. This is called triangulation.

If the camera calibration determining \(M_1,M_2\) are known, then the optimization problem for a single matched keypoint in two images becomes:

\[\min_{\mathbf{X}} f(\mathbf{X}) = \| x_1 - Proj(\mathbf{X},M_1)\|^2 + \| x_2 - Proj(\mathbf{X},M_2)\|^2\]

Reprojection Error for Many Keypoints in Many Cameras

Imagine now that we have \(m\) different cameras, and each camera has a known projection matrix \(\{M_i\}_{i=1}^m\). Suppose we match \(n\) different keypoints across the images, denoted \(\{\mathbf{X}_j\}_{j=1}^n\).

The optimization problem becomes:

\[\min_{\mathbf{X}_1,\mathbf{X}_2, \dots} \sum\limits_{i=1}^m \sum\limits_{j=1}^n \| x_{ij} - Proj(\mathbf{X_j},M_i)\|^2\]

In this case \(x_{ij}\) is the observed location of the \(j\)‘th keypoint into the \(i\)‘th image, as discovered by a keypoint detector in the SIFT-like pipeline. We penalize solutions for \(\mathbf{X}_1,\mathbf{X}_2,\dots\) in which \(x_{ij}\) is not very close to \(Proj(\mathbf{X_j},M_i)\).

Bundle Adjustment

Imagine now that we are working with arbitrary images for which we have no calibration information. Thus, the projection matrices \(\{M_i\}_{i=1}^m\) are unknown.

Bundle adjustment is the process of minimizing reprojection error over (1) multiple 3D points and (2) multiple cameras. Triggs et al. define it as “the problem of refining a visual reconstruction to produce jointly optimal structure and viewing parameter estimates” [2]. The optimization problem changes only by adding new, additional variables for which we solve.

\[\min_{\mathbf{X}_1,\mathbf{X}_2, \dots, M_1,M_2,\dots} \sum\limits_{i=1}^m \sum\limits_{j=1}^n \| x_{ij} - Proj(\mathbf{X_j},M_i)\|^2\]

According to [2], the name “Bundle Adjustment” refers to bundles of light rays leaving each 3D point and converging on each camera center, “which are ‘adjusted’ optimally with respect to both feature and camera positions”.

Gauss-Newton

Gauss-Newton is generally preferred to Second Order methods like Newton’s Method for a simple reason: deriving and implementing calculations of the second derivatives of the projection model \(Proj(X_j,M_i)\) is difficult and error-prone [2].

For example, the seminal work in SfM, “Building Rome in a Day” [3] uses Trust-Region Gauss-Newton optimization (Levenberg-Marquardt) that chooses between a truncated and an exact step Levenberg-Marquardt algorithm.

Exploiting Sparsity

Those who use generic optimization routines to solve SfM problems will find the optimization slow. This would be unwise, however, since the problem sparsity can be exploited.

Bundler

[6]

COLMAP

Rotation Averaging

R. Hartley, J. Trumpf, Y. Dai, and H. Li. Rotation Averaging. IJCV, 103(3):267–305, 2013

http://users.cecs.anu.edu.au/~hongdong/rotationaveraging.pdf

2 cos(|α|) = trace(R −1 gt Rest)−1, i.e., we measure the minimum rotation angle required to align both rotations

MRF scheme. Rotation Averaging slow.

Schur Complement

Hessian matrix H in bundle adjustment has a unique structure: Hcc and Hpp are sparse matrices with only diagonal elements in block units, whereas Hcp is a dense matrix. Efficient solutions that focus on this unique structure are the keys to developing highly precise and robust visual SLAM. State-of-the-art monocular SLAM methods, such as ORB-SLAM[21] and DSO[6], solve Equation 7 by decomposing it using the Schur complement matrix instead of directly solving it. (VITAMIN-E)

Rodrigues’ Vector

Triangulation

Imagine a 3d point \(\mathbf{X}\) seen in two different views, or perhaps in \(n\) different views. We can triangulate its 3d position if we know the relative pose of the cameras, camera intrinsics, and the 2d measurement locations. The problem reduces to computing the approximate intersection of rays.

A linear variant using the Direct Linear Transform (DLT) ships in GTSAM here and Kornia (code).

Hartley & Zisserman (page 312) provide a useful description of the DLT Triangulation method, which I’ll summarize below. We’ll form a homogeneous set of equations. Let \(\mathbf{X} = \begin{bmatrix} X & Y & Z & 1 \end{bmatrix}^T\). Let \(\mathbf{x} = \begin{bmatrix} x & y & 1 \end{bmatrix}^T\) represent a 2d measured point. We can write a projection equation for each view/measurement:

\[\begin{aligned} \mathbf{x} &= P \mathbf{X} \\ \mathbf{x}^\prime &= P^\prime \mathbf{X} \\ \mathbf{x}^{\prime\prime} &= P^{\prime\prime} \mathbf{X} \\ & \vdots \end{aligned}\]

We can use a cross product to get 3 equations for each measurement (2d image point):

\[\begin{aligned} \mathbf{x} \times \mathbf{x} &= \mathbf{x} \times (P \mathbf{X}) \\ \begin{bmatrix} 0 & -1 & y \\ 1 & 0 & -x \\ -y & x & 0 \end{bmatrix} \mathbf{x} &= \mathbf{x} \times P \mathbf{X} \\ \begin{bmatrix} 0 & -1 & y \\ 1 & 0 & -x \\ -y & x & 0 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} &= \mathbf{x} \times P \mathbf{X} \\ 0 &= \mathbf{x} \times P \mathbf{X} \\ 0 &= \begin{bmatrix} 0 & -1 & y \\ 1 & 0 & -x \\ -y & x & 0 \end{bmatrix} \begin{bmatrix} P_{11} & P_{12} & P_{13} & P_{14} \\ P_{21} & P_{22} & P_{23} & P_{24} \\ P_{31} & P_{32} & P_{33} & P_{34} \\ \end{bmatrix} \mathbf{X} \end{aligned}\]

You can see above that a linear of combination of the rows of P is being formed. Following Hartley and Zisserman, let \(\mathbf{p}_i^T\) represent the \(i\)‘th row of \(P\).

\[\begin{aligned} 0 &= \begin{bmatrix} 0 & -1 & y \\ 1 & 0 & -x \\ -y & x & 0 \end{bmatrix} \begin{bmatrix} - & \mathbf{p}^{1T} & - \\ - & \mathbf{p}^{2T} & - \\ - & \mathbf{p}^{3T} & - \end{bmatrix} \mathbf{X} \\ \end{aligned}\] \[\begin{aligned} y(\mathbf{p}_3^T\mathbf{X}) − (\mathbf{p}_2^T\mathbf{X}) &= 0 \\ \mathbf{p}_1^T\mathbf{X} - x (\mathbf{p}_3^T\mathbf{X}) &= 0 \\ x(\mathbf{p}_2^T\mathbf{X}) − y(\mathbf{p}_1^T\mathbf{X}) &= 0 \\ \end{aligned}\]

give three equations for each image point, of which two are linearly independent – Third line is a linear combination of the first and second lines. (x times the first line plus y times the second line) – See [9].

Since we can multiply both sides of any equation by -1, we will often see the second constraint written as

\[\begin{aligned} \mathbf{p}_1^T\mathbf{X} - x (\mathbf{p}_3^T\mathbf{X}) &= 0 \\ (-1) \mathbf{p}_1^T\mathbf{X} - (-1) x (\mathbf{p}_3^T\mathbf{X}) &= 0 * (-1) \\ x (\mathbf{p}_3^T\mathbf{X}) - \mathbf{p}_1^T\mathbf{X} &= 0 \end{aligned}\]

We end up with a tall but skinny data matrix \(A\) for a homogeneous system of equations:

\[A \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix} = \mathbf{0}\]

For 2 views, \(A\) could be expressed as:

\[\begin{aligned} A \mathbf{X} = \begin{bmatrix} x (\mathbf{p}_3^T) - \mathbf{p}_1^T \\ y(\mathbf{p}_3^T) − (\mathbf{p}_2^T) \\ x^\prime (\mathbf{p}_3^{\prime T}) - \mathbf{p}_1^{\prime T} \\ y^\prime (\mathbf{p}_3^{\prime T}) − (\mathbf{p}_2^{\prime }T) \end{bmatrix} \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix} = \mathbf{0} \end{aligned}\]

The code is then simple – since one 2D to 3D point correspondence give you 2 equations, a tall \(A\) matrix of shape \((2m, 4)\) is formed for \(m\) measurements. In GTSAM, the code follows the math exactly:

    // build system of equations
    A.row(row) = p.x() * projection.row(2) - projection.row(0);
    A.row(row + 1) = p.y() * projection.row(2) - projection.row(1);

Nonlinear Triangulation: Geometric error cost function

See Section 12.3 of Hartley Zisserman.

References

[1] Ramanan, Deva. Structure from Motion. http://16720.courses.cs.cmu.edu/lec/sfm.pdf.

[2] Bill Triggs, Philip McLauchlan, Richard Hartley and Andrew Fitzgibbon. Bundle Adjustment — A Modern Synthesis. https://hal.inria.fr/inria-00548290/document.

[3] Sameer Agarwal, Noah Snavely, Ian Simon, Steven M. Seitz, Richard Szeliski. Building Rome in a Day. Communications of the ACM,Volume 54 Issue 10, October 2011. Pages 105-112. https://grail.cs.washington.edu/rome/rome_paper.pdf.

[4] Luong, Q.-T. and Faugeras, O.: 1993, Determining the fundamental matrix with planes: instability and new algorithms, Proc. IEEE Conf. Comp. Vision Patt. Recog., New York City, NY, pp. 489{494.

[5] Luong, Q.-T., Deriche, R., Faugeras, O. and Papadopoulo, T.: 1993, On determining the fundamental matrix: analysis of di erent methods and experimental results, Technical Report 1894, INRIA SophiaAntipolis. PDF.

[6] N. Snavely, S. Seitz, and R. Szeliski. Modeling the world from internet photo collections. IJCV, 80(2):189–210, 2008.

[7] Christopher Zach. Robust Bundle Adjustment Revisited. PDF.

[8] Frank Dellaert. Notes PDF. Slides PDF

[9] Kris Kitani. http://www.cs.cmu.edu/~16385/s17/Slides/11.4_Triangulation.pdf