CUDATritonGPUDeep LearningSystems

JLR GPU Kernels

Worked on GPU kernel optimization for deep learning workloads in CUDA and Triton, placing 2nd in a Jaguar Land Rover hackathon. Achieved measurable throughput and latency improvements through kernel fusion of GELU and ReLU activation variants.

Reduction in execution time

12%

Throughput improvement for DL workloads

2nd

Place in the JLR GPU Kernel Hackathon

Placeholder, swap with your own benchmark chart

Kernel Fusion Strategy

→Fused GELU + linear projection into a single CUDA kernel, eliminating two round-trips to HBM
→Applied Triton autotuning for tile sizes across A100 and H100 targets
→Merged ReLU variants (Leaky, SiLU, GELU-approx) with preceding matmul ops
→Measured peak FLOP/s utilisation before and after fusion with NSight Compute

Context

Worked on GPU kernel optimization for deep learning inference workloads as part of a Jaguar Land Rover hackathon. The work focused on reducing memory overhead and improving throughput via strategic kernel fusion.

Skills: Machine Learning, CUDA, Triton.

Languages

CUDA C++, Python, Triton

Tools

NSight Compute, PyTorch, cuBLAS

Hardware

NVIDIA A100, H100 (via cloud)