Table of Contents:

Raycasting

Unlike raytracing, raycasting seeks to find only the first intersection between a ray and a surface (if it exists), rather than continuing to trace the path. It turns out the math for this is quite simple to do naively, but the interesting part is figuring out how to make the implementation very cheap and fast. Scratchapixel has some very clear notes that I’ll draw on below. We’ll walk through the derivation from first principles, and show how to implement a slow version in Python, and a fast version in CUDA.

System of Equations for Ray/Triangle Intersection

Suppose we model the surface as a triangle mesh. In order to find this intersection point \(P\), we need a parametric equation of a ray, and of a triangle. Suppose we define the ray by its origin \(O\) and its direction \(R\). The ray parametric equation is \(P=O+tR\) where \(t\) is the distance from the ray origin \(O\) to point \(P\). To find \(P\), we need to compute distance \(t\).

The three vertices \((v_0,v_1,v_2)\) of a triangle define a plane, and we can compute its normal vector with a single cross product as follows:

def compute_triangle_plane_normal(v0: np.ndarray, v1: np.ndarray, v2: np.ndarray) -> np.ndarray:
    """Compute normal to plane spanned by triangle's 3 vertices.
       v2
      /  \
     /    \
    v0 --- v1

    Args:
        v0: (3,) array representing triangle vertex 0, ordered counter-clockwise (CCW).
        v1: (3,) array representing triangle vertex 1.
        v2: (3,) array representing triangle vertex 2.

    Returns:
        Unit length plane normal vector of shape (3,).
    """
    v0v1 = v1 - v0
    v0v2 = v2 - v0
    N = np.cross(v0v1, v0v2)
    N /= np.linalg.norm(N)
    return N

Now, we can form a system of equations. Suppose we have \(P=(x,y,z)\). We know ray origin \(O\) and direction \(R\), along with the plane normal \(N=(A,B,C)\) and distance \(D\). Our only unknowns are \(P=(x,y,z)\) and \(t\):

\[\begin{aligned} P = O + tR & \\ Ax + By + Cz + D = 0 & \\ A * P_x + B * P_y + C * P_z + D = 0 & \rightarrow \mbox{ rewrite } P=(x,y,z) \mbox{ as } P=(P_x,P_y,P_z) \\ \end{aligned}\]

We’ll now substitute \(P\) (from the first equation) to \((x, y, z)\) in equation 2 and solve for distance \(t\). Let \(O=(O_x, O_y, O_z)\) and let \(R=(R_x, R_y, R_z)\):

\[\begin{aligned} A * (O_x + tR_x) + B * (O_y + tR_y) + C * (O_z + tR_z) + D = 0\\ A * O_x + B * O_y + C * O_z + A * tR_x + B * tR_y + C * tR_z + D = 0\\ t * (A * R_x + B * R_y + C * R_z) + A * O_x + B * O_y + C * O_z + D = 0\\ t * (A * R_x + B * R_y + C * R_z) = -\Big( A * O_x + B * O_y + C * O_z + D \Big)\\ t = -{\dfrac{A * O_x + B * O_y + C * O_z + D}{A * R_x + B * R_y + C * R_z}}\\ t = -{\dfrac{ N(A,B,C) \cdot O + D}{N(A,B,C) \cdot R}} \end{aligned}\]

The math simplifies nicely, and this part is quite simple to implement:

N = compute_triangle_plane_normal(v0, v1, v2)

NdotRayDirection = N.dot(ray_dir)

# compute d parameter of implicit line equation
d = N.dot(v0)

# compute t
t = (d - N.dot(origin)) / NdotRayDirection

This shows how to find where the ray intersects a 3d plane. There are at least 3 more conditions we need to check for, though:

  1. Is the intersection point inside of the triangle’s perimeter, or outside?
  2. Are the ray direction and the 3d plane parallel to one another?
  3. Is the triangle “behind” the ray?

For condition (1), we’ll need another sub-routine, which we’ll call the “inside-outside test”.

We can determine condition (2) easily by ensuring that:

NdotRayDirection = N.dot(ray_dir)
kEpsilon = 1e-10
is_parallel = np.absolute(NdotRayDirection) < kEpsilon:  # almost 0

Condition (3) is also easy to verify: after solving for \(t\), we check the sign of \(t\) and ensure that it’s non-negative.

Inside-Outside Test

def inside_outside_test(
    N: np.ndarray, v0: np.ndarray, v1: np.ndarray, v2: np.ndarray, P: np.ndarray
) -> bool:
    """Determine whether a point on a 3d plane falls within a triangle's boundary.

       v2
      /  \
     /    \
    v0 --- v1

    C is a vector perpendicular to the triangle's plane.

    Args:
        N: (3,) array representing plane normal.
        v0: (3,) array representing triangle vertex 0, ordered counter-clockwise (CCW).
        v1: (3,) array representing triangle vertex 1.
        v2: (3,) array representing triangle vertex 2.
        P: (3,) array representing a point lying on the triangle's plane.

    Returns:
        Boolean indicating whether plane point falls within triangle.
    """
    # edge 0
    edge0 = v1 - v0
    vp0 = P - v0
    C = np.cross(edge0, vp0)
    if N.dot(C) < 0:
        return False  # P is on the right side

    # edge 1
    edge1 = v2 - v1
    vp1 = P - v1
    C = np.cross(edge1, vp1)
    if N.dot(C) < 0:
        return False  # P is on the right side

    # edge 2
    edge2 = v0 - v2
    vp2 = P - v2
    C = np.cross(edge2, vp2)
    if N.dot(C) < 0:
        return False  # P is on the right side

    return True

Now that we’ve shown how to do this for a single point, suppose we want to vectorize it, to operate over N points at once:

def inside_outside_test_vectorized(
    N: np.ndarray, v0: np.ndarray, v1: np.ndarray, v2: np.ndarray, Ps: np.ndarray
) -> np.array:
    """Determine whether multiple points on a 3d plane fall within a triangle's boundary.

    Args:
        N: (3,) array representing plane normal.
        v0: (3,) array representing triangle vertex 0, ordered counter-clockwise (CCW).
        v1: (3,) array representing triangle vertex 1.
        v2: (3,) array representing triangle vertex 2.
        Ps: (N,3) array representing points lying on the triangle's plane.
            Not necessarily within the triangle. These came from ray-plane intersection.

    Returns:
        Boolean array of shape (N,) indicating which plane points falls within the triangle.
    """
    # edge 0
    edge0 = v1 - v0
    # edge 1
    edge1 = v2 - v1
    # edge 2
    edge2 = v0 - v2

    vp0 = Ps - v0
    vp1 = Ps - v1
    vp2 = Ps - v2

    C0 = np.cross(edge0, vp0)
    C1 = np.cross(edge1, vp1)
    C2 = np.cross(edge2, vp2)

    # if (dot product < 0), then P is on the right side
    return np.logical_and.reduce([C0.dot(N) >= 0, C1.dot(N) >= 0, C2.dot(N) >= 0])

Using these subroutines, we can put together a first naive algorithm for ray-triangle intersection computation:

def ray_triangle_intersect(
    origin: np.ndarray, ray_dir: np.ndarray, v0: np.ndarray, v1: np.ndarray, v2: np.ndarray
) -> Tuple[bool, Optional[np.ndarray]]:
    """Compute ray-triangle intersection, if such a intersection is possible.

    Args:
        origin: Array of shape (3,) representing ray origin.
        ray_dir: ray direction shape (3,)
        v0: (3,) array representing triangle vertex 0, ordered counter-clockwise (CCW).
        v1: (3,) array representing triangle vertex 1.
        v2: (3,) array representing triangle vertex 2.

    Returns:
        Boolean indicating whether valid intersection occurred.
        P: array of shape (3,) representing intersection point if exists, otherwise None.
    """
    N = compute_triangle_plane_normal(v0, v1, v2)

    # Step 1: finding P
    # check if ray and plane are parallel ?
    NdotRayDirection = N.dot(ray_dir)
    kEpsilon = 1e-10
    if np.absolute(NdotRayDirection) < kEpsilon:  # almost 0
        return False, None  # they are parallel so they don't intersect !

    # compute d parameter of implicit line equation
    d = N.dot(v0)

    # compute t, the distance along the ray from the origin.
    t = (d - N.dot(origin)) / NdotRayDirection
    # check if the triangle is in behind the ray
    if t < 0:
        return False, None  # the triangle is behind

    # compute the intersection point using ray parameterization
    P = origin + t * ray_dir

    is_inside = inside_outside_test(N, v0, v1, v2, P)
    if not is_inside:
        return False, None

    return True, P  # this ray hits the triangle

Cramer’s rule and Barycentric coordinates

Scratchapixel has an excellent set of notes on this topic.

Moller-Trombore

A clever optimization is…

Embree

def ray_triangle_intersect_moller_trombore(
    origin: np.ndarray, ray_dir: np.ndarray, v0: np.ndarray, v1: np.ndarray, v2: np.ndarray
) -> Tuple[bool, Optional[np.ndarray]]:
    """Compute ray-triangle intersection using the Moller-Trombore algorithm, if such a intersection is possible.
    
    t is the distance along the ray from the origin

    Args:
        origin: shape (3,)
        ray_dir: ray direction shape (3,)
        v0: triangle vertex 0, ordered counter-clockwise (CCW)
        v1: triangle vertex 1
        v2: triangle vertex 2

    Returns:
        boolean indicating whether intersection is valid
        P: intersection point if exists, otherwise None
    """
    v0v1 = v1 - v0
    v0v2 = v2 - v0
    pvec = np.cross(ray_dir, v0v2)
    det = v0v1.dot(pvec)

    # CULLING
    # if the determinant is negative the triangle is backfacing
    # if the determinant is close to 0, the ray misses the triangle
    kEpsilon = 1e-10
    if det < kEpsilon:
        return False, None

    invDet = 1 / det

    tvec = origin - v0
    u = tvec.dot(pvec) * invDet
    if (u < 0) or (u > 1):
        return False, None

    qvec = np.cross(tvec, v0v1)
    v = ray_dir.dot(qvec) * invDet
    if (v < 0) or (u + v > 1):
        return False, None

    t = v0v2.dot(qvec) * invDet

    P = origin + t * ray_dir
    return True, P


def ray_triangle_intersect_vectorized_moller_trombore(
    origin: np.ndarray, ray_dirs: np.ndarray, v0: np.ndarray, v1: np.ndarray, v2: np.ndarray
) -> Tuple[np.ndarray, np.ndarray]:
    """Use Cramer's rule and Barycentric coordinates, per
    https://www.scratchapixel.com/lessons/3d-basic-rendering/ray-tracing-rendering-a-triangle/barycentric-coordinates

    Args:
        ray_dirs: N x 3 for directions of N ray

    Returns:
        valid: array of bool, whether hit or not
        Ps: array of intersection points, otherwise NULL values
    """
    v0v1 = v1 - v0
    v0v2 = v2 - v0
    pvec = np.cross(ray_dirs, v0v2)
    det = pvec.dot(v0v1)

    # CULLING
    # if the determinant is negative the triangle is backfacing
    # if the determinant is close to 0, the ray misses the triangle
    kEpsilon = 1e-10
    # valid = det >= kEpsilon

    invDet = 1 / det

    tvec = origin - v0
    u = pvec.dot(tvec) * invDet

    # valid = np.logical_and.reduce(
    # 	[
    # 		u >= 0,
    # 		u <= 1,
    # 		valid
    # 	])

    qvec = np.cross(tvec, v0v1)
    v = ray_dirs.dot(qvec) * invDet
    valid = np.logical_and.reduce(
        [
            det >= kEpsilon,
            u >= 0,
            u <= 1,
            v >= 0,
            u + v <= 1
            # valid
        ]
    )
    t = v0v2.dot(qvec) * invDet

    # compute the intersection point using ray parameterization
    # see broadcast example below (so last dims match for multiply)
    Ps = origin + (t * ray_dirs.T).T
    return valid, Ps


def ray_triangle_intersect_vectorized(
    origin: np.ndarray, ray_dirs: np.ndarray, v0: np.ndarray, v1: np.ndarray, v2: np.ndarray
) -> Tuple[np.ndarray, np.ndarray]:
    """

    Args:
        ray_dirs: N x 3 for directions of N ray

    Returns:
        valid: array of bool, whether hit or not
        Ps: array of intersection points, otherwise NULL values
    """
    N = compute_triangle_plane_normal(v0, v1, v2)

    # Step 1: finding P
    # check if ray and plane are parallel ?
    NdotRayDirections = ray_dirs.dot(N)
    kEpsilon = 1e-10

    # if almost 0, they are parallel so they don't intersect !
    valid = np.absolute(NdotRayDirections) > kEpsilon

    # compute d parameter of implicit line equation
    d = N.dot(v0)

    DENOMINATOR_PADDING = 100
    NdotRayDirections[~valid] += DENOMINATOR_PADDING

    # compute t -- a vector of distances along the ray
    t = (d - N.dot(origin)) / NdotRayDirections

    # compute the intersection point using ray parameterization
    # see broadcast example below (so last dims match for multiply)
    Ps = origin + (t * ray_dirs.T).T

    is_inside = inside_outside_test_vectorized(N, v0, v1, v2, Ps)

    # check if the triangle is in behind the ray
    # if t < 0, then # the triangle is behind
    valid = np.logical_and.reduce([valid, t >= 0, is_inside])

    # if true, then # this ray hits the triangle
    return valid, Ps



import time

import numpy as np

import tbv.rendering.ray_triangle_intersection as raytracing_cpu


def test_compute_triangle_plane_normal() -> None:
    """ """
    # triangle in the 2d xy-plane
    v0 = np.array([0.0, 0, 0])
    v1 = np.array([3.0, 0, 0])
    v2 = np.array([1.0, 2, 0])
    N = raytracing_cpu.compute_triangle_plane_normal(v0, v1, v2)
    assert np.allclose(N, np.array([0.0, 0.0, 1.0]))

    # 3d example, diagonal slice of a cube
    v0 = np.array([0.0, 2.0, 0])
    v1 = np.array([2.0, 0, 0])
    v2 = np.array([2.0, 2, 2])
    N = raytracing_cpu.compute_triangle_plane_normal(v0, v1, v2)
    # points toward the origin (-x,-y), but upwards (+z)
    assert np.allclose(N, np.array([-1 / np.sqrt(3), -1 / np.sqrt(3), 1 / np.sqrt(3)]))


def test_inside_outside_test_vectorized() -> None:
    """ """
    # example entirely in xy-plane for now
    # in CCW order, triangle lives in x-axis plane
    N = np.array([0, 0, 1])
    v0 = np.array([0.0, 0, 0])
    v1 = np.array([3.0, 0, 0])
    v2 = np.array([1.0, 2, 0])

    assert np.allclose(N, raytracing_cpu.compute_triangle_plane_normal(v0, v1, v2))

    Ps = np.array(
        [
            [2, -2, 0],  # P is outside the triangle, but inside the xy-plane
            [1, 1, 0],  # now P is inside the triangle, and inside the xy-plane
        ]
    )

    is_inside = raytracing_cpu.inside_outside_test_vectorized(N, v0, v1, v2, Ps)
    gt_is_inside = np.array([False, True])
    assert np.allclose(is_inside, gt_is_inside)

    # get random points
    Ps = np.random.randn(100, 3)
    # place on x-y plane
    Ps[:, 2] = 0

    is_inside = raytracing_cpu.inside_outside_test_vectorized(N, v0, v1, v2, Ps)
    gt_is_inside = np.array([raytracing_cpu.inside_outside_test(N, v0, v1, v2, P) for P in Ps])
    assert np.allclose(is_inside, gt_is_inside)


def test_inside_outside_test() -> None:
    """ """
    # example entirely in xy-plane for now
    # in CCW order, triangle lives in x-axis plane
    N = np.array([0, 0, 1])
    v0 = np.array([0.0, 0, 0])
    v1 = np.array([3.0, 0, 0])
    v2 = np.array([1.0, 2, 0])
    assert np.allclose(N, compute_triangle_plane_normal(v0, v1, v2))

    # P is outside the triangle, but inside the xy-plane
    P = np.array([2, -2, 0])
    is_inside = inside_outside_test(N, v0, v1, v2, P)
    assert is_inside == False

    # now P is inside the triangle, and inside the xy-plane
    P = np.array([1, 1, 0])
    is_inside = inside_outside_test(N, v0, v1, v2, P)
    assert is_inside == True



def test_broadcast_multiply() -> None:
    """ """
    t = np.array([1, 2, 3, 4])
    ray_dirs = np.array([[2, 1, 0], [4, 2, 0], [6, 3, 0], [8, 6, 0]])
    product = (t * ray_dirs.T).T
    gt_product = np.array([[2, 1, 0], [8, 4, 0], [18, 9, 0], [32, 24, 0]])
    assert np.allclose(product, gt_product)


def test_ray_triangle_intersect_vectorized1() -> None:
    """ """

    for intersect_fn in [raytracing_cpu.ray_triangle_intersect_vectorized_moller_trombore, raytracing_cpu.ray_triangle_intersect_vectorized]:
        origin = np.array([0.0, 0.0, 0.0])

        v0 = np.array([2.0, 1, 0])
        v1 = np.array([2.0, -1, 0])
        v2 = np.array([2.0, 0, 2])

        ray_dirs = np.array(
            [
                [1.0, 0.0, 0.0],  # simple case when it hits correctly at base of triangle
                [1.0, 0.0, 1.0],  # case when hits top of triangle
                [1.0, 0, 4.0],  # case when ray goes high too quickly
                [-1.0, 0, 0.0],  # # case when ray pointing in wrong direction
            ]
        )

        inter_exists, inter_points = intersect_fn(origin, ray_dirs, v0, v1, v2)

        gt_inter_exists = np.array([True, True, False, False])
        assert np.allclose(inter_exists, gt_inter_exists)

        gt_inter_points = np.array(
            [[2.0, 0.0, 0.0], [2.0, 0.0, 2.0], [np.nan, np.nan, np.nan], [np.nan, np.nan, np.nan]]
        )

        assert np.allclose(inter_points[:2], gt_inter_points[:2])


def test_ray_triangle_intersect_vectorized2() -> None:
    """Stress test against non-vectorized version"""

    for vec_intersect_fn in [ray_triangle_intersect_vectorized_moller_trombore, ray_triangle_intersect_vectorized]:

        origin1 = np.array([1.63, 0, 1.42])
        # Big triangle
        #              x, y, z
        v0 = np.array([1, 10, 0]).astype(np.float32)
        v1 = np.array([1, -10, 0]).astype(np.float32)
        v2 = np.array([100, 0, 0]).astype(np.float32)
        tri1 = [v0, v1, v2]

        origin2 = np.array([0.0, 0.0, 0.0])

        v0 = np.array([2.0, 1, 0])
        v1 = np.array([2.0, -1, 0])
        v2 = np.array([2.0, 0, 2])
        tri2 = [v0, v1, v2]

        scenario1 = [origin1, tri1]
        scenario2 = [origin2, tri2]

        num_rays = 10000

        for (origin, tri) in [scenario1, scenario2]:
            v0, v1, v2 = tri
            ray_dirs = np.random.randn(num_rays, 3)
            inter_exists_arr, inter_points_arr = vec_intersect_fn(origin, ray_dirs, v0, v1, v2)

            for i in range(num_rays):
                ray_dir = ray_dirs[i]
                for intersect_fn in [ray_triangle_intersect_moller_trombore, ray_triangle_intersect]:

                    inter_exists, P = intersect_fn(origin, ray_dir, v0, v1, v2)
                    assert inter_exists_arr[i] == inter_exists
                    if inter_exists:
                        assert np.allclose(inter_points_arr[i], P)


def test_ray_triangle_intersect() -> None:
    """ 
       v2
      /  \
     /    \
    v0 --- v1
    all 3 vertices sit in x=2 plane
    """
    for intersect_fn in [raytracing_cpu.ray_triangle_intersect_moller_trombore, raytracing_cpu.ray_triangle_intersect]:
        # simple case when it hits correctly at base of triangle
        v0 = np.array([2.0, 1, 0])
        v1 = np.array([2.0, -1, 0])
        v2 = np.array([2.0, 0, 2])

        origin = np.array([0.0, 0.0, 0.0])

        ray_dir = np.array([1.0, 0.0, 0.0])
        inter_exists, P = intersect_fn(origin, ray_dir, v0, v1, v2)

        assert inter_exists == True
        gt_P = np.array([2.0, 0.0, 0.0])
        assert np.allclose(P, gt_P)

        # case when hits top of triangle
        ray_dir = np.array([1.0, 0.0, 1.0])
        inter_exists, P = intersect_fn(origin, ray_dir, v0, v1, v2)

        assert inter_exists == True
        gt_P = np.array([2.0, 0.0, 2.0])
        assert np.allclose(P, gt_P)

        # case when ray goes high too quickly
        ray_dir = np.array([1.0, 0, 4.0])
        inter_exists, P = intersect_fn(origin, ray_dir, v0, v1, v2)

        assert inter_exists == False
        assert P is None

        # case when ray pointing in wrong direction
        ray_dir = np.array([-1.0, 0, 0.0])
        inter_exists, P = intersect_fn(origin, ray_dir, v0, v1, v2)

        assert inter_exists == False
        assert P is None


def test_ray_triangle_intersect_looking_down() -> None:
    """ """
    for intersect_fn in [raytracing_cpu.ray_triangle_intersect_moller_trombore, raytracing_cpu.ray_triangle_intersect]:

        origin = np.array([1.63, 0, 1.42])
        ray_dir = np.array([0, 0, -1])

        #              x, y, z
        v0 = np.array([1, 10, 0]).astype(np.float32)
        v1 = np.array([1, -10, 0]).astype(np.float32)
        v2 = np.array([100, 0, 0]).astype(np.float32)

        inter_exists, P = intersect_fn(origin, ray_dir, v0, v1, v2)
        assert inter_exists
        assert np.allclose(P, np.array([1.63, 0, 0]))


def test_ray_triangle_intersect() -> None:
    """ raytracing moller_trombore returns [0,0,0] as None placeholder
	   v2
	  /  \
	 /    \
	v0 --- v1
	all 3 vertices sit in x=2 plane
	"""
    intersect_fn = raytracing_cpu.ray_triangle_intersect_moller_trombore

    # simple case when it hits correctly at base of triangle
    v0 = np.array([2.0, 1, 0])
    v1 = np.array([2.0, -1, 0])
    v2 = np.array([2.0, 0, 2])

    origin = np.array([0.0, 0.0, 0.0])

    ray_dir = np.array([1.0, 0.0, 0.0])
    inter_exists, P = intersect_fn(origin, ray_dir, v0, v1, v2)

    assert inter_exists == True
    gt_P = np.array([2.0, 0.0, 0.0])
    assert np.allclose(P, gt_P)

    # case when hits top of triangle
    ray_dir = np.array([1.0, 0.0, 1.0])
    inter_exists, P = intersect_fn(origin, ray_dir, v0, v1, v2)

    assert inter_exists == True
    gt_P = np.array([2.0, 0.0, 2.0])
    assert np.allclose(P, gt_P)

    # case when ray goes high too quickly
    ray_dir = np.array([1.0, 0, 4.0])
    inter_exists, P = intersect_fn(origin, ray_dir, v0, v1, v2)

    assert inter_exists == False
    assert np.allclose(P, np.zeros(3))

    # case when ray pointing in wrong direction
    ray_dir = np.array([-1.0, 0, 0.0])
    inter_exists, P = intersect_fn(origin, ray_dir, v0, v1, v2)

    assert inter_exists == False
    assert np.allclose(P, np.zeros(3))


def test_ray_triangle_intersect_looking_down() -> None:
    """ """
    intersect_fn = raytracing_cpu.ray_triangle_intersect_moller_trombore

    origin = np.array([1.63, 0, 1.42])
    ray_dir = np.array([0, 0, -1])

    #              x, y, z
    v0 = np.array([1, 10, 0]).astype(np.float32)
    v1 = np.array([1, -10, 0]).astype(np.float32)
    v2 = np.array([100, 0, 0]).astype(np.float32)

    inter_exists, P = intersect_fn(origin, ray_dir, v0, v1, v2)
    assert inter_exists
    assert np.allclose(P, np.array([1.63, 0, 0]), atol=1e-7)


def test_intersect_rays_with_tri_mesh_stress_test() -> None:
    """
    2000 x 1000 px image, w/ 200 triangles, 37.88 on macbook, but 24 sec without extra mallocs in loop
    2 sec if bail early upon first hit
    """
    num_rays = 1500 * 2048
    # num_rays = 1000

    triangles_matrix = get_flat_plane_grid_triangles(range_m=10)
    num_triangles = triangles_matrix.shape[0]
    print(f"Num triangles: {num_triangles}")

    origin = np.ones(3)
    ray_dirs = np.random.randn(num_rays, 3)
    # smokescreen and correctness
    start = time.time()
    inter_exists_arr, inter_points_arr = tbv_raytracing.intersect_rays_with_tri_mesh(triangles_matrix, origin, ray_dirs)
    end = time.time()
    duration = end - start
    print(f"Took {duration} sec.")

    assert inter_exists_arr.shape[0] == num_rays
    assert inter_points_arr.shape[0] == num_rays

    for j in range(num_rays):
        ray_hit_this_tri = np.zeros(num_triangles, dtype=bool)
        where_ray_hit_this_tri = np.zeros((num_triangles, 3))

        for i in range(num_triangles):
            ray_dir = ray_dirs[j]
            v0 = triangles_matrix[i][:3]
            v1 = triangles_matrix[i][3:6]
            v2 = triangles_matrix[i][6:]
            inter_exists, P = raytracing_cpu.ray_triangle_intersect_moller_trombore(origin, ray_dir, v0, v1, v2)
            ray_hit_this_tri[i] = inter_exists
            where_ray_hit_this_tri[i] = P

        assert inter_exists_arr[j] == ray_hit_this_tri.sum()
        if ray_hit_this_tri.sum() == 1:
            i = int(np.where(ray_hit_this_tri)[0])
            assert np.allclose(where_ray_hit_this_tri[i], inter_points_arr[j])

    # assert per_triangle_inter_exists_arr[i][j] == inter_exists
    # 		assert np.allclose(per_triangle_inter_points[i][j], P)


def get_flat_plane_grid_triangles(range_m: float = 30) -> np.ndarray:
    """
    v2 - v3
    |  \\ |
    v0 - v1
    """
    nearby_triangles = []
    # loop over x
    for x in range(-range_m, range_m):
        # loop over y
        for y in range(-range_m, range_m):

            v0 = np.array([x, y, 0]).astype(np.float)
            v1 = np.array([x + 1, y, 0]).astype(np.float)
            v2 = np.array([x, y + 1, 0]).astype(np.float)
            v3 = np.array([x + 1, y + 1, 0]).astype(np.float)

            tri_lower = np.hstack([v0, v1, v2])
            tri_upper = np.hstack([v2, v1, v3])

            nearby_triangles.append(tri_lower)
            nearby_triangles.append(tri_upper)

    return np.array(nearby_triangles)


if __name__ == "__main__":
    """ """
    # test_ray_triangle_intersect()
    # test_ray_triangle_intersect_looking_down()
    test_intersect_rays_with_tri_mesh_stress_test()

    # test_ray_triangle_intersect_vectorized1()
    # quit()
    # test_ray_triangle_intersect_looking_down()
    # test_ray_triangle_intersect()

    # test_compute_triangle_plane_normal()
    # test_inside_outside_test()

    # test_broadcast_multiply()
    # test_inside_outside_test_vectorized()

    test_ray_triangle_intersect_vectorized2()
// Copyright (c) 2021
// Argo AI, LLC, All Rights Reserved.
// 
// Notice: All information contained herein is, and remains the property
// of Argo AI. The intellectual and technical concepts contained herein
// are proprietary to Argo AI, LLC and may be covered by U.S. and Foreign
// Patents, patents in process, and are protected by trade secret or
// copyright law. This work is licensed under a CC BY-NC-SA 4.0 
// International License.
// 
// Originating Authors: John Lambert


#include <sstream>
#include <iostream>
#include <cuda_runtime.h>
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
#include <pybind11/stl.h>
#include <cuda_runtime.h>
#include <limits> // for std::numeric limits
#include <stdio.h>


// #include <pybind11/eigen.h>


/** add 
* @{
*/
__device__ float3 operator+(const float3& a, const float3& b)
{
  return make_float3(a.x + b.x, a.y + b.y, a.z + b.z);
}
__device__ float3 operator+(const float3& a, const float b)
{
  return make_float3(a.x + b, a.y + b, a.z + b);
}
__device__ float3 operator+(const float a, const float3& b)
{
  return make_float3(a + b.x, a + b.y, a + b.z);
}
__device__ void operator+=(float3& a, const float3& b)
{
  a.x += b.x; a.y += b.y; a.z += b.z;
}
/** @} */

/** subtract 
* @{
*/
__device__ float3 operator-(const float3& a, const float3& b)
{
  return make_float3(a.x - b.x, a.y - b.y, a.z - b.z);
}
__device__ float3 operator-(const float3& a, const float b)
{
  return make_float3(a.x - b, a.y - b, a.z - b);
}
__device__ float3 operator-(const float a, const float3& b)
{
  return make_float3(a - b.x, a - b.y, a - b.z);
}
__device__ void operator-=(float3& a, const float3& b)
{
  a.x -= b.x; a.y -= b.y; a.z -= b.z;
}
/** @} */

/** multiply 
* @{
*/
__device__ float3 operator*(const float3& a, const float3& b)
{
  return make_float3(a.x * b.x, a.y * b.y, a.z * b.z);
}
__device__ float3 operator*(const float3& a, const float s)
{
  return make_float3(a.x * s, a.y * s, a.z * s);
}
__device__ float3 operator*(const float s, const float3& a)
{
  return make_float3(a.x * s, a.y * s, a.z * s);
}
__device__ void operator*=(float3& a, const float3& s)
{
  a.x *= s.x; a.y *= s.y; a.z *= s.z;
}
__device__ void operator*=(float3& a, const float s)
{
  a.x *= s; a.y *= s; a.z *= s;
}
/** @} */

/** divide 
* @{
*/
__device__ float3 operator/(const float3& a, const float3& b)
{
  return make_float3(a.x / b.x, a.y / b.y, a.z / b.z);
}
__device__ float3 operator/(const float3& a, const float s)
{
  float inv = 1.0f / s;
  return a * inv;
}
__device__ float3 operator/(const float s, const float3& a)
{
  return make_float3( s/a.x, s/a.y, s/a.z );
}
__device__ void operator/=(float3& a, const float s)
{
  float inv = 1.0f / s;
  a *= inv;
}
/** @} */

/** dot product from optixu/optixu_math_namespace.h */
__device__ float dot(const float3& a, const float3& b)
{
  return a.x * b.x + a.y * b.y + a.z * b.z;
}

/** cross product from optixu/optixu_math_namespace.h */
__device__ float3 cross(const float3& a, const float3& b)
{
  return make_float3(a.y*b.z - a.z*b.y, a.z*b.x - a.x*b.z, a.x*b.y - a.y*b.x);
}

/** normalize from optixu/optixu_math_namespace.h */
__device__ float3 normalize(const float3& v)
{
  float invLen = 1.0f / sqrtf(dot(v, v));
  return v * invLen;
}



__device__ float3 ray_triangle_intersect_kernel(
  float3 origin,
  float3 ray_dir,
  float3 v0,
  float3 v1,
  float3 v2)
{
  float NO_HIT_VALUE = -99999;

  float3 v0v1 = v1 - v0;
  float3 v0v2 = v2 - v0;
  float3 pvec = cross(ray_dir,v0v2);
  double det = dot(v0v1,pvec);

  float kEpsilon = 1e-10;
  if (det < kEpsilon)
  {
    return make_float3(NO_HIT_VALUE,NO_HIT_VALUE,NO_HIT_VALUE);
  }

  double invDet = 1 / det;
  float3 tvec = origin - v0;
  
  double u = dot(tvec,pvec) * invDet;
  if ((u < 0) || (u > 1))
  {
    return make_float3(NO_HIT_VALUE,NO_HIT_VALUE,NO_HIT_VALUE);
  }

  float3 qvec = cross(tvec,v0v1);
  float v = dot(ray_dir,qvec) * invDet;
  if ((v < 0) || (u + v > 1))
  {
    return make_float3(NO_HIT_VALUE,NO_HIT_VALUE,NO_HIT_VALUE);
  }

  float t = dot(v0v2,qvec) * invDet;

  float3 P = origin + t * ray_dir;
  return P;
}




/*
moller_trombore algorithm
Will have num_rays global thread indices

Each thread handles 1 ray

Args
  ray_dirs: float array with 3 * num_rays elements
  hits: float array with 3 * num_rays * num_triangles elements
  triangles: float array with 9 * num_triangles elements

  int size_bytes = num_rays*N_COORDS_PER_RAY*sizeof(double);
  ray_dirs_gpu, 

  size_bytes = num_triangles*N_VERTS_PER_TRI*N_COORDS_PER_TRI_VERTEX*sizeof(double);
  triangles_gpu

  size_bytes = num_triangles*num_rays*3*sizeof(double);
  hits_gpu

*/
__global__ void ray_mesh_intersect_kernel(
  const float3 origin,
  const double* ray_dirs,
  const double* triangles,
  double* hits,
  const int num_rays,
  const int num_triangles)
{
  unsigned int ray_idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (ray_idx < (num_rays) )
  {
    float NO_HIT_VALUE = -99999;
    float kEpsilon = 1e-10;

    unsigned int ray_offs = 3 * ray_idx; // n'th element to take from the array
    unsigned int hit_offs = ray_offs;
    float3 ray_dir = make_float3(ray_dirs[ray_offs], ray_dirs[ray_offs+1], ray_dirs[ray_offs+2]);

    for (int tri_idx=0; tri_idx < num_triangles; tri_idx++)
    {
      unsigned int tri_offs = 9 * tri_idx; // n'th element to take from the array
  
      float3 v0 = make_float3(triangles[tri_offs  ], triangles[tri_offs+1], triangles[tri_offs+2]);
      float3 v1 = make_float3(triangles[tri_offs+3], triangles[tri_offs+4], triangles[tri_offs+5]);
      float3 v2 = make_float3(triangles[tri_offs+6], triangles[tri_offs+7], triangles[tri_offs+8]);

      float3 hit = ray_triangle_intersect_kernel(origin, ray_dir, v0, v1, v2);
      if (
           ( fabs(hit.x - NO_HIT_VALUE) > kEpsilon) 
        && ( fabs(hit.y - NO_HIT_VALUE) > kEpsilon) 
        && ( fabs(hit.z - NO_HIT_VALUE) > kEpsilon)
        )
      {
        hits[hit_offs] = hit.x;
        hits[hit_offs + 1] = hit.y;
        hits[hit_offs + 2] = hit.z;
        return;
      }
    }
    // this ray never hit any triangle
    hits[hit_offs] = NO_HIT_VALUE;
    hits[hit_offs + 1] = NO_HIT_VALUE;
    hits[hit_offs + 2] = NO_HIT_VALUE;

  }
}



// each ray tries to hit a triangle, then exits 
// otherwise, send out all triangle-ray pairs
void run_intersection_kernel(
  const float3 & origin,
  const double* ray_dirs, // on gpu
  const double* triangles, // on gpu
  double* hits, // on gpu
  int num_rays,
  int num_triangles)
{
  int n_thread = 512; // also try 256
  dim3 dimBlock(n_thread, 1, 1); // threads per block
  int num_blocks_reqd = ceil( float(num_rays) / float(dimBlock.x) );
  dim3 dimGrid(num_blocks_reqd); // number of blocks

  //std::cout << "Num rays: " << num_rays << std::endl;
  //std::cout << "Num triangles: " << num_triangles << std::endl;
  //std::cout << "n thread: " << n_thread << std::endl;
  //std::cout << "Num blocks required:" << num_blocks_reqd << std::endl;

  ray_mesh_intersect_kernel<<<dimGrid, dimBlock>>>
    (
      origin,
      ray_dirs,
      triangles,
      hits,
      num_rays,
      num_triangles
    );

  cudaError_t error = cudaGetLastError();
  if (error != cudaSuccess) {
    std::stringstream strstr;
    strstr << "run_kernel launch failed" << std::endl;
    strstr << "dimBlock: " << dimBlock.x << ", " << dimBlock.y << std::endl;
    strstr << "dimGrid: " << dimGrid.x << ", " << dimGrid.y << std::endl;
    strstr << cudaGetErrorString(error);
    throw strstr.str();
  }
}


/*
TODO: 8000 images over 1000 logs is 8M calls of this function
Make one class, and prevent 8M mallocs

Also try 1 thread per triangle-ray pair, and compare runtime.
*/

//std::tuple<Eigen::VectorXb, Eigen::MatrixXd>
void intersect_rays_with_tri_mesh(
  pybind11::array_t<double> triangles,
  pybind11::array_t<double> origin,
  pybind11::array_t<double> ray_dirs,
  pybind11::array_t<double> hits)
{
  pybind11::buffer_info tri_info = triangles.request();
  pybind11::buffer_info ray_info = ray_dirs.request();


  if (tri_info.ndim != 2) {
    std::stringstream strstr;
    strstr << "triangles.ndim != 2" << std::endl;
    strstr << "triangles.ndim: " << tri_info.ndim << std::endl;
    throw std::runtime_error(strstr.str());
  }

  if (tri_info.shape[1] != 9) {
    std::stringstream strstr;
    strstr << "triangles.shape[1] != 9" << std::endl;
    strstr << "triangles.shape[1]: " << tri_info.shape[1] << std::endl;
    throw std::runtime_error(strstr.str());
  }

  if (ray_info.ndim != 2) {
    std::stringstream strstr;
    strstr << "ray_dirs.ndim != 2" << std::endl;
    strstr << "ray_dirs.ndim: " << ray_info.ndim << std::endl;
    throw std::runtime_error(strstr.str());
  }

  if (ray_info.shape[1] != 3) {
    std::stringstream strstr;
    strstr << "ray_dirs.shape[1] != 9" << std::endl;
    strstr << "ray_dirs.shape[1]: " << ray_info.shape[1] << std::endl;
    throw std::runtime_error(strstr.str());
  }

  int num_rays = ray_info.shape[0];
  int num_triangles = tri_info.shape[0];

  double ox = origin.data()[0];
  double oy = origin.data()[1];
  double oz = origin.data()[2];
  float3 origin_vec = make_float3(ox, oy, oz);
  
  double* ray_dirs_gpu;
  double* triangles_gpu;
  double* hits_gpu;

  int N_COORDS_PER_TRI_VERTEX = 3;
  int N_VERTS_PER_TRI = 3;
  int N_COORDS_PER_RAY = 3;
  int N_COORDS_PER_HIT = 3;
  
  int size_bytes = num_rays*N_COORDS_PER_RAY*sizeof(double);
  cudaError_t error = cudaMalloc(&ray_dirs_gpu, size_bytes);
  if (error != cudaSuccess) {
    throw std::runtime_error(cudaGetErrorString(error));
  }
  double* ray_dirs_ptr = reinterpret_cast<double*>(ray_info.ptr);
  error = cudaMemcpy(ray_dirs_gpu, ray_dirs_ptr, size_bytes, cudaMemcpyHostToDevice);
  if (error != cudaSuccess) {
    throw std::runtime_error(cudaGetErrorString(error));
  }

  size_bytes = num_triangles*N_VERTS_PER_TRI*N_COORDS_PER_TRI_VERTEX*sizeof(double);
  error = cudaMalloc(&triangles_gpu, size_bytes);
  if (error != cudaSuccess) {
    throw std::runtime_error(cudaGetErrorString(error));
  }
  double* triangles_ptr = reinterpret_cast<double*>(tri_info.ptr);
  error = cudaMemcpy(triangles_gpu, triangles_ptr, size_bytes, cudaMemcpyHostToDevice);
  if (error != cudaSuccess) {
    throw std::runtime_error(cudaGetErrorString(error));
  }

  size_bytes = num_rays*N_COORDS_PER_HIT*sizeof(double);
  //std::cout << "Malloc " << size_bytes << " for hits_gpu" << std::endl;
  error = cudaMalloc(&hits_gpu, size_bytes);
  if (error != cudaSuccess) {
    std::cout << "Malloc failed!" << std::endl;
    throw std::runtime_error(cudaGetErrorString(error));
  }

  run_intersection_kernel(
    origin_vec,
    ray_dirs_gpu,
    triangles_gpu,
    hits_gpu,
    num_rays,
    num_triangles
  );

  //std::cout << "kernel ran successfully!" << std::endl;
  //std::cout << "Ray_dirs size: " << ray_info.size << std::endl;

  /* No pointer is passed, so NumPy will allocate the buffer */
  // auto inter_points_arr = pybind11::array_t<double>(ray_info.size);

  pybind11::buffer_info hits_info = hits.request();
  double *hits_cpu_ptr = static_cast<double *>(hits_info.ptr);

  error = cudaMemcpy(hits_cpu_ptr, hits_gpu, size_bytes, cudaMemcpyDeviceToHost);
  if (error != cudaSuccess) {
    throw std::runtime_error(cudaGetErrorString(error));
  }

  // for (int i=0; i<ray_info.size; i++)
  // {
  //   std::cout << "Hit " << i << ": " << hits_cpu_ptr[i]  << std::endl;
  // }

  // Eigen::VectorXb inter_exists_arr = Eigen::VectorXb(num_rays);
  // inter_exists_arr.setConstant(false); 

  // Eigen::MatrixXd inter_points_arr = Eigen::MatrixXd(num_rays,3);
  // inter_points_arr.setZero();

  error = cudaFree(ray_dirs_gpu);
  if (error != cudaSuccess) {
    throw std::runtime_error(cudaGetErrorString(error));
  }

  error = cudaFree(triangles_gpu);
  if (error != cudaSuccess) {
    throw std::runtime_error(cudaGetErrorString(error));
  }

  error = cudaFree(hits_gpu);
  if (error != cudaSuccess) {
    throw std::runtime_error(cudaGetErrorString(error));
  }

  // return std::make_tuple(inter_exists_arr, inter_points_arr);
  //return inter_points_arr;
};



PYBIND11_MODULE(tbv_raytracing, m)
{
  m.def("intersect_rays_with_tri_mesh", &intersect_rays_with_tri_mesh);
}

References:

  1. Tomas Moller and Ben Trombore. Fast Minimum Storage Ray/Triangle Intersection. Journal of Graphics Tools, 1997. PDF.

  2. Scratchapixel. Ray Tracing: Rendering a Triangle. URL.