DevJobs

MLOps Engineer - AI Infra Group

Overview
Skills
  • Elasticsearch Elasticsearch
  • Kubernetes Kubernetes
  • Grafana Grafana
  • Terraform Terraform
  • ArgoCD
  • GitOps
  • Prometheus Prometheus
  • MLflow
  • NVIDIA GPU Operator
At Dream, we redefine cyber defense vision by combining AI and human expertise to create products that protect nations and critical infrastructure. This is more than a job; It’s a Dream job. Dream is where we tackle real-world challenges, redefine AI and security, and make the digital world safer. Let’s build something extraordinary together.

Dream's AI cybersecurity platform applies a new, out-of-the-ordinary, multi-layered approach, covering endless and evolving security challenges across the entire infrastructure of the most critical and sensitive networks. Central to our Dream's proprietary Cyber Language Models are innovative technologies that provide contextual intelligence for the future of cybersecurity.

At Dream, our talented team, driven by passion, expertise, and innovative minds, inspires us daily. We are not just dreamers, we are dream-makers.

The Dream Job:

We are on an expedition to find you, someone who is passionate about creating intuitive, out-of-this-world production-grade AI infrastructure. This group builds scalable, high-performance AI systems for internal users and external customers, designed to run seamlessly across cloud and on-premise environments using the latest hardware advancements.

The Dream-Maker Responsibilities:

  • Design, build, and maintain scalable Kubernetes-based infrastructure for ML workloads across on-premise and cloud environments
  • Architect hybrid infrastructure solutions enabling seamless model flow from on-premise training environments to cloud-based inference deployments
  • Implement model registry and artifact management strategies that support cross-environment synchronization, versioning, and governance
  • Design secure, efficient data and model transfer mechanisms between on-premise and cloud (networking, storage replication, caching strategies)
  • Implement and manage GPU scheduling, resource allocation, and cluster autoscaling for heterogeneous compute environments
  • Build and maintain CI/CD pipelines for ML systems, including model versioning, testing, and promotion across environments
  • Develop observability solutions (logging, monitoring, alerting) for ML infrastructure across hybrid deployments
  • Collaborate with ML Engineers to define infrastructure requirements and SLAs for training and serving workloads

The Dream Skill Set:

  • 5+ years of experience in infrastructure engineering, platform engineering, or DevOps, preferably supporting ML or data-intensive workloads
  • Experience designing and operating hybrid cloud architectures (on-premise + cloud) with focus on data/model synchronization
  • Familiarity with model registry solutions (MLflow or cloud-native registries) and artifact management at scale
  • Experience with GPU compute infrastructure, device plugins, and resource scheduling (e.g., NVIDIA GPU Operator)
  • Proficiency in IaC tools (Terraform) and GitOps practices (ArgoCD)
  • Experience with monitoring and observability stacks (Prometheus, Grafana, ELK)
  • Familiarity with ML workflows to understand workload characteristics and requirements

Never Stop Dreaming...:

If you think this role doesn't fully match your skills but are eager to grow and break glass ceilings, we’d love to hear from you!
Dream Security