JLR GPU Kernels
Worked on GPU kernel optimization for deep learning workloads in CUDA and Triton, placing 2nd in a Jaguar Land Rover hackathon. Achieved measurable throughput and latency improvements through kernel fusion of GELU and ReLU activation variants.
8%
Reduction in execution time
12%
Throughput improvement for DL workloads
2nd
Place in the JLR GPU Kernel Hackathon
Placeholder, swap with your own benchmark chart
Kernel Fusion Strategy
- →Fused GELU + linear projection into a single CUDA kernel, eliminating two round-trips to HBM
- →Applied Triton autotuning for tile sizes across A100 and H100 targets
- →Merged ReLU variants (Leaky, SiLU, GELU-approx) with preceding matmul ops
- →Measured peak FLOP/s utilisation before and after fusion with NSight Compute
Context
Worked on GPU kernel optimization for deep learning inference workloads as part of a Jaguar Land Rover hackathon. The work focused on reducing memory overhead and improving throughput via strategic kernel fusion.
Skills: Machine Learning, CUDA, Triton.
Languages
CUDA C++, Python, Triton
Tools
NSight Compute, PyTorch, cuBLAS
Hardware
NVIDIA A100, H100 (via cloud)