Design and implement high-performance compute kernels for AI primitives such as GEMM, attention, normalization, and convolution.
Optimize for throughput, latency, and memory hierarchy across heterogeneous compute units (SIMD, matrix engines, DMA).
Collaborate with compiler and runtime teams to integrate kernels into Triton, PyTorch, or SYCL pipelines.
Profile and tune kernels using tools like Perfetto, VTune, Tracy, or custom simulators.
Prototype and evaluate precision formats (FP16/BF16/FP8/e5m2, etc.) and stochastic rounding.
Contribute to micro-architecture feedback loops, helping co-design ISA and memory features with the hardware team.
Write clear, well-structured, and reusable code (C++/CUDA/Triton/LLVM MLIR).
Requirements:
Bachelor's or Master's in Computer Science, Computer Engineering, or a related field from a recognized university.
Strong background in parallel programming (CUDA, Triton, SYCL, OpenCL, Metal, POSIX Threads, or OpenMP).
Experience with optimization of irregular algorithms, such as graph computations or sparse numerical linear algebra, combining high-level data structure design with low-level SIMD and synchronization optimizations.
Deep understanding of memory layout, vectorization, thread/block scheduling, and cache behavior.
Proficiency in C++11 or higher, with strong knowledge of standard algorithms, data structures, and generic programming paradigms.
Experience with code generation for high-performance computations and knowledge of frameworks like BLAS/BLIS/Torch
Skilled in performance analysis and parallel debugging using tools such as Valgrind, GNU Debugger, or CI testing frameworks.
Hands-on experience profiling and optimizing compute or AI workloads (e.g., GEMM, softmax, attention).
Solid grasp of numerical stability, precision formats, and mixed precision arithmetic.
Collaborative work style with the ability to operate effectively in multicultural, cross-disciplinary environments.