DevJobs

Software Engineer (Large Scale Training)

Overview
Skills
  • C++ C++
  • Python Python
  • Kubernetes Kubernetes
  • CUDA
  • GPU
  • JAX
  • Metal
  • OpenCL
  • Pallas
  • TPU
  • Triton
Who We Are

Lightricks is an AI-first company creating next-generation content creation technology for businesses, enterprises, and studios with a mission to bridge the gap between imagination and creation. At our core is LTX-2, an open-source generative video model, built to deliver expressive, high-fidelity video at unmatched speed. It powers both our own products and a growing ecosystem of partners through API integration.

The company is also known globally for pioneering consumer creativity through products like Facetune, one of the world's most recognized creative brands, which helped introduce AI-powered visual expression to hundreds of millions of users worldwide. We combine deep research, user-first design, and end-to-end execution from concept to final render to bring the future of expression to all.

About The Role

This is a software engineering role on an ML team. You'll own the systems that make large-scale model training fast, reliable, and pleasant to work with, the distributed training framework, the data pipelines feeding it, the performance characteristics of every step on the critical path, and the day-to-day developer experience for the researchers who depend on it.

You don't need to come in as an ML expert. You do need to be a strong engineer who gets excited about hard systems problems: squeezing throughput out of accelerator clusters, hunting down stragglers across hundreds of machines, designing abstractions that hold up as the codebase grows, and making the unglamorous parts of training infrastructure work well.

If you've ever looked at a large-scale system and thought "there's no reason this should take this slow / inefficient / hard to maintain / complex," this role is built for you.

Key Responsibilities

  • Build and maintain the distributed training framework: orchestration, checkpointing, fault tolerance, observability, and the ergonomics researchers interact with daily.
  • Profile end-to-end training runs and eliminate bottlenecks wherever they live- compute, memory, interconnect, storage, or the data pipeline.
  • Collaborate with researchers to translate model ideas into training code that runs efficiently, and flag when an architectural choice will be expensive before it ships.
  • Own a shared codebase the team relies on: correctness, readability, testing, and long-term maintainability matter as much as the benchmark numbers.
  • Work close to the metal where it pays off- write or integrate custom GPU kernels, tune collective communication, and exploit hardware features that off-the-shelf frameworks leave on the table.

Your Skills And Experience

  • Strong software engineering fundamentals. You write clean, tested, maintainable Python, and you're comfortable reading and writing modern C++.
  • Real experience with performance work- profiling, optimization, and reasoning about systems where latency, throughput, and resource contention actually matter.
  • Comfort with distributed systems: you've debugged things that only break at scale and have intuitions for where they tend to go wrong.
  • A bias toward understanding systems end-to-end rather than treating any layer as a black box.
  • Familiarity with Kubernetes or similar environments for running and scaling large workloads.
  • ML training experience is a bonus. If you have it, great, but we'd rather hire a strong systems engineer who's curious about ML than an ML engineer who's lukewarm about infrastructure.

Nice to have

  • Working knowledge of at least one accelerator architecture (GPU, TPU, or similar), or a clear track record of going deep on hardware when the problem calls for it.
  • Experience with JAX/Pallas, Triton, CUDA, OpenCL, Metal, or similar accelerator programming.
  • Prior exposure to ML training pipelines, even informally- pet projects count.

Why Join Us

We’re here to push the boundaries of what’s possible with AI and video - not for the buzz, but for the craft, the challenge, and the chance to make something genuinely new.

We believe in an environment where people are encouraged to think, create, and explore. Real impact happens when people are empowered to experiment, evolve, and elevate together. At Lightricks, every breakthrough starts with great people and a collaborative mindset. If you're looking for a place that combines deep tech, creative energy, and zero buzzword culture, you might be in the right place.

We got you covered:

  • We run daily door-to-door shuttles, offering Car-to-go subscriptions for several locations in central Israel, plus free parking and train-station pickups.
  • We’re proud to have 2 chef-led restaurants on site by the legendary Machneyuda Group (yes, that Machneyuda!), plus a bakery nestled in the heart of our office, filled daily with the scent of fresh pastries.
  • We empower employees with cutting-edge tools and learning opportunities to grow and succeed through workshops, access and training on platforms, subscriptions, and clear guidelines for responsible AI use.
Lightricks