DevJobs

AI Accelerator Software Engineer – Silicon Software & Low-Level AI

Overview
Skills
  • C C ꞏ 6y
  • C++ C++ ꞏ 6y
  • Assembly Assembly
  • Accelerator ꞏ 6y
  • Firmware ꞏ 6y
  • Embedded ꞏ 6y
  • Systems ꞏ 6y
  • Hardware-aware algorithm optimization
  • Memory hierarchies
  • Parallel execution
  • DMA
  • Systems-oriented reasoning
  • Caches
  • Bit-level reasoning
  • Bandwidth optimization
  • Performance-debug
  • Profiling
  • NPU programming
  • Low-level programming
  • HIP
  • Tracing
  • GPU programming
  • FPGA programming
  • Firmware development
  • DSP programming
  • DSP algorithms
  • Driver development
  • Deep learning infrastructure
  • Custom accelerator programming
  • CUDA
  • Compute kernel development
  • AI inference optimization

AI Accelerator Software Engineer – Silicon Software & Low-Level AI

Most GPU engineers work within the limits of what NVIDIA decided.

Here, you decide the limits.

GSI Technology (NASDAQ: GSIT) is developing Gemini2 — an Associative Processing Unit built for ultra-low latency, high-parallelism AI execution. We're not building on top of someone else's stack. We're building the stack — and we need engineers who've been waiting for exactly this kind of problem.

🔬 The gap you'll close

Between modern AI models and novel compute-in-memory hardware lies a space that PyTorch can't see and CUDA can't reach — memory access patterns, DMA flows, instruction scheduling, and execution strategies that simply don't have a reference implementation yet.

That's your domain.

⚙️ What you'll build

Highly optimized compute kernels for Transformer inference, LLM/VLM execution, FFTs, OpenCV pipelines, and Edge AI workloads

Memory access patterns, DMA utilization, and instruction scheduling — tuned for silicon that didn't exist two years ago

Performance analysis pipelines using profilers, traces, and hardware analyzers — and then fix what you find

Benchmarking infrastructure, internal tooling, and testing frameworks

Work directly with Architecture, Compiler, and AI teams — your kernel-level decisions shape how the next version of the chip gets designed

✅ What we need

B.Sc./M.Sc. in CS, EE, or equivalent

6+ years in low-level C/C++: embedded, firmware, accelerator, systems, or performance-critical software

Deep understanding of:

Memory hierarchies, caches, DMA, and bandwidth optimization

Parallel execution and performance-critical code

Hardware-aware algorithm optimization

Bit-level and systems-oriented reasoning

⭐ Strong bonus if you bring

GPU / NPU / DSP / FPGA or custom accelerator programming

Assembly or low-level programming experience

Compute kernel, firmware, or driver development

AI inference optimization or deep learning infrastructure

Profiling, tracing, and performance-debug experience

🎯 You're likely a strong fit if you've ever...

Written CUDA or HIP kernels — and wanted to go deeper than the driver allows

Spent days hunting a 3% latency regression in embedded firmware and felt satisfied when you found it

Looked at a DMA controller spec and felt curious, not scared

Worked on DSP algorithms and wondered what it'd feel like to do it for AI workloads

Had opinions about both sides of a hardware/software interface

📍 Tel Aviv, Ramat Hahayal | Full-Time | Hybrid

💰 Competitive compensation + (NASDAQ: GSIT)

Not sure if your background is the right fit? Reach out— we'd rather have the conversation.

GSI Technology