Warp Efficiency in CUDA Kernels

CUDA programming requires understanding how warps execute code. Warp divergence can kill performance, but understanding the mechanics lets us write efficient GPU code.

Warp Basics

A warp consists of 32 CUDA threads executing in lockstep. When threads diverge via conditionals, the hardware serializes execution:

__global__ void kernelWithDivergence(float* data) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    
    if (idx % 2 == 0) {
        // Even threads do heavy work
        data[idx] = expensiveCalculation(data[idx]);
    } else {
        // Odd threads wait idle
        data[idx] = simpleCalculation(data[idx]);
    }
}

Avoiding Divergence

Restructure to keep threads in sync:

__global__ void kernelOptimized(float* dataEven, float* dataOdd) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    
    // No divergence - all threads do same work
    dataEven[idx] = expensiveCalculation(dataEven[idx]);
    dataOdd[idx] = expensiveCalculation(dataOdd[idx]);
}

Memory Access Patterns

Coalesced memory access is critical:

Good: data[threadIdx.x] - sequential, coalesced
Bad: data[threadIdx.x * stride] - random, uncoalesced

Shared Memory Best Practices

__shared__ float sharedData[256];

// Avoid bank conflicts
sharedData[threadIdx.x] = global_data[idx];  // Good
__syncthreads();

// 2D indexing with padding
__shared__ float padded[16][33];  // Extra column to avoid conflicts

Measuring Efficiency

Use nvprof or NVIDIA’s profiler:

nvprof --metrics gst_efficiency,achieved_occupancy ./kernel

Target metrics:

Global Store Throughput: >90%
Occupancy: >80%
Warp Efficiency: >95%

With careful attention to divergence and memory patterns, you can achieve 2-3x speedups over naive implementations.