DevJobs

AI Kernel Writer

Overview
Skills
  • C++ C++
  • Valgrind
  • CUDA
  • LLVM MLIR
  • Metal
  • OpenCL
  • OpenMP
  • POSIX Threads
  • SYCL
  • Triton
  • BLAS
  • BLIS
  • CI testing frameworks
  • GNU Debugger
  • Perfetto
  • Torch
  • Tracy
  • VTune
  • Design and implement high-performance compute kernels for AI primitives such as GEMM, attention, normalization, and convolution.
  • Optimize for throughput, latency, and memory hierarchy across heterogeneous compute units (SIMD, matrix engines, DMA).
  • Collaborate with compiler and runtime teams to integrate kernels into Triton, PyTorch, or SYCL pipelines.
  • Profile and tune kernels using tools like Perfetto, VTune, Tracy, or custom simulators.
  • Prototype and evaluate precision formats (FP16/BF16/FP8/e5m2, etc.) and stochastic rounding.
  • Contribute to micro-architecture feedback loops, helping co-design ISA and memory features with the hardware team.
  • Write clear, well-structured, and reusable code (C++/CUDA/Triton/LLVM MLIR).

Requirements:

  • Bachelor's or Master's in Computer Science, Computer Engineering, or a related field from a recognized university.
  • Strong background in parallel programming (CUDA, Triton, SYCL, OpenCL, Metal, POSIX Threads, or OpenMP).
  • Experience with optimization of irregular algorithms, such as graph computations or sparse numerical linear algebra, combining high-level data structure design with low-level SIMD and synchronization optimizations.
  • Deep understanding of memory layout, vectorization, thread/block scheduling, and cache behavior.
  • Proficiency in C++11 or higher, with strong knowledge of standard algorithms, data structures, and generic programming paradigms.
  • Experience with code generation for high-performance computations and knowledge of frameworks like BLAS/BLIS/Torch
  • Skilled in performance analysis and parallel debugging using tools such as Valgrind, GNU Debugger, or CI testing frameworks.
  • Hands-on experience profiling and optimizing compute or AI workloads (e.g., GEMM, softmax, attention).
  • Solid grasp of numerical stability, precision formats, and mixed precision arithmetic.
  • Collaborative work style with the ability to operate effectively in multicultural, cross-disciplinary environments.
Majestic Labs