Warp Efficiency in CUDA Kernels

Warp Efficiency in CUDA Kernels

CUDA programming requires understanding how warps execute code. Warp divergence can kill performance, but understanding the mechanics lets us write efficient GPU code.

Warp Basics

A warp consists of 32 CUDA threads executing in lockstep. When threads diverge via conditionals, the hardware serializes execution:

__global__ void kernelWithDivergence(float* data) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    
    if (idx % 2 == 0) {
        // Even threads do heavy work
        data[idx] = expensiveCalculation(data[idx]);
    } else {
        // Odd threads wait idle
        data[idx] = simpleCalculation(data[idx]);
    }
}

Avoiding Divergence

Restructure to keep threads in sync:

__global__ void kernelOptimized(float* dataEven, float* dataOdd) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    
    // No divergence - all threads do same work
    dataEven[idx] = expensiveCalculation(dataEven[idx]);
    dataOdd[idx] = expensiveCalculation(dataOdd[idx]);
}

Memory Access Patterns

Coalesced memory access is critical:

  • Good: data[threadIdx.x] - sequential, coalesced
  • Bad: data[threadIdx.x * stride] - random, uncoalesced

Shared Memory Best Practices

__shared__ float sharedData[256];

// Avoid bank conflicts
sharedData[threadIdx.x] = global_data[idx];  // Good
__syncthreads();

// 2D indexing with padding
__shared__ float padded[16][33];  // Extra column to avoid conflicts

Measuring Efficiency

Use nvprof or NVIDIA’s profiler:

nvprof --metrics gst_efficiency,achieved_occupancy ./kernel

Target metrics:

  • Global Store Throughput: >90%
  • Occupancy: >80%
  • Warp Efficiency: >95%

With careful attention to divergence and memory patterns, you can achieve 2-3x speedups over naive implementations.