Optimizing BVH Traversal in Ray-Tracers

When implementing ray-tracing engines, one of the most critical bottlenecks is the Bounding Volume Hierarchy (BVH) traversal. In this deep dive, we’ll explore how to eliminate performance cliffs through strategic memory layout and SIMD optimization.

The Problem with Cache Misses

Traditional BVH implementations often suffer from poor cache locality. Consider a typical tree traversal:

struct BVHNode {
    AABB bounds;
    int left, right;
    int* triangles;
};

// This causes cache misses at nearly every level
bool traverse(Ray& ray) {
    BVHNode& node = nodes[current];
    if (node.bounds.intersect(ray)) {
        // Pointers lead everywhere in memory
    }
}

Cache-Friendly Layout

The solution involves restructuring our data for spatial coherence:

struct CompactNode {
    float bounds[6];  // AABB as floats
    union {
        int left;
        int triangleOffset;
    };
    int metadata;     // flags and counts
};

SIMD Acceleration

Using SIMD instructions to test multiple rays or node bounds simultaneously:

void simdTraverse(SimdRay rays[4]) {
    __m128 boundsX = _mm_set_ps(...);
    __m128 intersect = rayBoxTest_SSE(rays, boundsX);
    // Process 4 rays in parallel
}

Benchmarks

Our optimized approach shows 3-4x improvement in traversal speed:

Original: 145 ns/ray
Optimized: 38 ns/ray

The key is maintaining cache-line alignment and minimizing pointer chasing.