We're expanding our Cloud Infrastructure and Platform Engineering team and looking for an experienced Senior DevOps engineer to shape and scale our GenAI platform, 40+ Kubernetes clusters across GCP and AWS, with a strong emphasis on GPU workloads.
Our stack centers on multi-tenant GPU scheduling, GitOps via ArgoCD, and the operational complexity of running production-grade AI training and inference at scale. You'll spend your time on the infrastructure problems that come with running AI at this scale, GPU scheduling, training and inference in production, and real cost and reliability tradeoffs.
Hands-on engineering role. You'll shape the architecture and operation of the systems in your scope, not just execute against a backlog.
Role and Responsibilities:
- Cloud & Kubernetes: Design, operate, and scale multi-cluster Kubernetes environments across GCP and AWS.
- GPU & AI workloads: Own multi-tenant GPU scheduling for training and inference at scale, capacity planning, utilization, and cost.
- Developer experience & enablement: Lead the developer platform that powers R&D, self-service tools and automation that compound across the org.
- Reliability & cost: Optimize cost, performance, and reliability through monitoring, capacity planning, and scaling strategies.
- Security & governance: Set the bar for RBAC, IAM, cloud security, and compliance across our cloud footprint.
- Infrastructure as Code: Drive GitOps and IaC adoption (Terraform, Helm, Crossplane, ArgoCD).
- Cross-team collaboration: Partner with engineering teams to align infrastructure with product and reliability needs.
- Technology assessment: Evaluate and adopt technologies that improve scalability and efficiency.
Requirements:
- 7+ years in DevOps or SRE.
- Deep Kubernetes expertise, large-scale, multi-cluster, enterprise-grade environments on GCP and/or AWS.
- Hands-on experience operating GPU workloads in production at scale, multi-tenant scheduling, capacity, utilization.
- Strong background in Infrastructure as Code (Terraform, Helm) and GitOps principles (ArgoCD, Crossplane, FluxCD).
- Hands-on experience with observability and monitoring (Prometheus, Grafana, Datadog, OpenTelemetry).
Nice-to-have
- Experience with self-hosted on-prem deployments and managed private VPC deployments (Bring Your Own Cloud).
- Experience designing and managing CRDs and custom controllers.
- DevSecOps experience with security automation and compliance frameworks.
- Experience operating GenAI or large-scale SaaS platforms.
About Us:
AI21 Labs is pioneering the development of Foundation Models and AI Systems for enterprises, accelerating the adoption of Generative AI in production.
Established in 2017 by AI visionaries Prof. Amnon Shashua, Prof. Yoav Shoham, and Ori Goshen, our mission is to equip businesses with cutting-edge LLMs and AI capabilities. Backed by leading investors like Pitango, Google, Nvidia, Intel Capital, and Comcast Ventures.
Join us on this exciting journey and advance your career with AI21 Labs!