This role is a member of the Google Cloud Engineering team and will be responsible for implementing and supporting core company infrastructure and dependent tenants in Google Cloud Platform. You will work directly with infrastructure teams and may collaborate with data scientists, machine learning engineers, portfolio managers, application developers, and quantitative analysts by functioning as both a solutions architect and professional services engineer.
This is a hands-on developer role requiring experience deploying and supporting production-ready code in cloud environments, as well as automating the build and management of cloud infrastructure, preferably in Python. Candidates should be familiar with developing unit and functional tests, designing and implementing CI/CD pipelines with infrastructure as code, and have knowledge of Linux systems administration, containerization, networking, security, automated configuration and state management, cross-system orchestration, configuration management, logging, metrics, monitoring, and alerting.
In addition to these responsibilities, the ideal candidate will work on agentic design patterns, optimize agents’ response quality, integrate external MLOps tools, deploy AI/ML models in cloud environments, support existing ML/AI tools as needed, and perform other MLOps-related tasks based on organizational needs.
Our Israel office is located in the Bursa area of Ramat Gan.
This role will be on-site.
As a global firm, proficiency in English is required.
Principal Responsibilities
- Lead the design and implementation of end-to-end cloud solutions that are scalable, reliable, and aligned with organizational goals
- Work on integration of AI tools and systems within the cloud environment
- Lead infrastructure initiatives across their lifecycle, ensuring seamless integration and optimal performance
- Monitor and improve the quality of responses generated by agentic systems and AI tools in production environments
- Architect and maintain internal and customer-facing cloud and/or AI solutions, serving as a Subject Matter Expert in Google Cloud Platform infrastructure
- Execute the technical roadmap to shape the technology stack and drive innovation
- Implement and manage CI/CD pipelines, monitoring systems, and automation tools to streamline operations and enhance service uptime
- Strengthen system security through audits, best practices, and robust measures
- Manage multiple deployment environments, ensuring consistency and reliability across deployments
- Provide insights to improve products from a user-centric perspective
Required Qualifications/Skills
- B.Sc. in Computer Science or another quantitative field
- At least 4 years of experience in DevOps/MLOps, with a hands-on approach to problem-solving and a deep enthusiasm for technology, including designing and supporting production cloud environments
- Extensive knowledge of cloud infrastructure, including scalable, multi-service production architectures and complex network systems
- Proven experience in programming with Python, Go, or Java
- Strong background in CI/CD workflows, ideally with tools like GitLab CI or Jenkins, and experience implementing observability solutions and monitoring systems to ensure reliability and performance
- Proficient in configuration management and Infrastructure-as-Code tools, such as Ansible, Terraform, and Helm
- Expertise in Linux administration, system internals, and network troubleshooting, with knowledge of cloud networking (connectivity, routing, DNS, VPCs, proxies, and load balancers)
- Strong understanding of cloud security principles
- Experience consulting with customers to develop public cloud solutions and collaboratively developing infrastructure as code
- Proven ability to work effectively in team environments, fostering inclusivity, collaboration, and cooperation
- Excellent written and verbal communication skills, along with strong troubleshooting and analytical abilities
- Experience with AI/ML frameworks (e.g., TensorFlow, PyTorch), distributed computing tools (e.g., Ray, Dask), model serving tools (e.g., vLLM, KFServing), and deploying production-ready AI/ML models in cloud environments (Advantage)
- Experience with MLOps tooling and concepts, such as experiment tracking (MLflow), model serving, feature stores, or pipeline orchestration (Kubeflow, Vertex AI, SageMaker) (Advantage)
- Experience building applications on top of LLMs using frameworks like LangChain or LlamaIndex and knowledge of patterns like RAG
- (Advantage) Experience with containerized workloads and Kubernetes orchestration (Advantage)