Cloudinary empowers companies to deliver exceptional digital experiences by managing the entire media lifecycle at scale. Within Cloudinary’s R&D, the Research Group leads the development of cutting-edge algorithms for media understanding, generation, and optimization. We are seeking an experienced
Staff Backend Engineer to lead the engineering efforts behind our homegrown platform for serving and operating production-grade AI models and AI based algorithms. This is a mission-critical role for someone passionate about building highly-scalable, GPU-aware, cloud-native systems that act as the connective tissue between algorithm research and product innovation. You will play a pivotal part in re-designing and evolving the platform, while supporting both research and application teams across the organization, and contributing to MLOps initiatives.
Key Responsibilities
Platform Ownership
- Own the architecture, stability, scalability, and performance of the system
- Design and implement platform features that support both synchronous low-latency and asynchronous compute-heavy algorithm execution
- Enhance GPU management, scheduling, and resource allocation for optimal performance and cost-efficiency
- Ensure robust Kubernetes-based deployment and observability for a highly dynamic system
Cross-Team Collaboration
- Act as the technical bridge between Research and Application teams by translating requirements into scalable system designs
- Collaborate closely with algorithm developers to streamline model deployment processes
- Partner with backend engineers (primarily working in Ruby and Go) to integrate the research group algorithms into Cloudinary services
Engineering Excellence
- Advocate for high standards in code quality, observability, testing, and security
- Guide engineering integration efforts when consuming the different platform APIs
- Provide mentorship, support, and best practices to other engineers interacting with the platform
- Take part in general R&D efforts, supporting a broader production environment
Platform Extension and MLOps
- Contribute to the evolution of MMS to support a wider range of algorithmic workloads and model types
- Help shape tooling and infrastructure for model versioning, rollout, monitoring, and testing
- Collaborate with DevOps and Infrastructure teams to maintain operational excellence, system observability, and robust infrastructure support
Your Qualifications
- 8+ years of experience in software engineering, with 3+ years working on infrastructure/platforms involving ML/AI, GPU, or data-heavy systems
- Proficiency in Python and familiarity with backend languages such as Ruby and/or Go
- Strong understanding of Kubernetes internals and experience running GPU workloads in production environments
- In-depth knowledge of AWS services
- Experience architecting systems that support both real-time and asynchronous processing pipelines
- Familiarity with the ML lifecycle and MLOps practices, including CI/CD for models, monitoring, and rollback strategies
Bonus Qualifications
- Experience working in research-driven environments or alongside data scientists, algorithm research team and ML engineers
- Contributions to open-source projects related to model serving, Kubernetes operators, or ML platforms
- Experience supporting systems with diverse user groups across engineering and research disciplines
Why Join Us?
- Opportunity to build and scale a one-of-a-kind platform powering state-of-the-art media algorithms
- Collaborate with world-class research, engineering, and product teams
- Have a direct impact on product experiences used by millions of developers and end-users
- Be part of a culture that values creativity, autonomy, and continuous improvement