DevJobs

ML Infrastructure Engineer

Overview
Skills
  • Python Python ꞏ 3y
  • PyTorch PyTorch
  • CI/CD CI/CD
  • GCP GCP
  • Docker Docker
  • Terraform Terraform
  • Weights & Biases
  • Vertex AI
  • Spot scheduling
  • MLflow
  • Infrastructure-as-Code
  • GPU instance management
  • GCS
  • GCE
  • FastAPI
  • Model monitoring
  • ONNX
  • Pulumi
  • ETL
  • DVC
  • Streaming data pipelines
  • Drift detection
  • TorchScript
  • Triton
  • Data lakes
  • Canary deployment

Overview

We’re looking for an ML Infrastructure Engineer to own the full ML pipeline that powers Psistar’s AI products – from data acquisition through model serving. You’ll be the critical link between our research team and our production engineering stack, making it fast and frictionless for researchers to go from idea to experiment, and for engineering to consume trained models in production.

The emphasis here is on infrastructure and tooling rather than model development. While you’ll work closely with our research team and need a strong understanding of ML internals, your primary impact will be in building the systems and pipelines that accelerate the entire team.


Responsibilities

MLOps & Training Infrastructure

  • Own the training infrastructure that gets researchers from idea to running experiment with minimal friction
  • Keep our compute fast, reproducible, and cost-efficient across managed and spot infrastructure
  • Build the experiment tracking and management story so researchers can launch, compare, and reproduce work without thinking about ops

Data Pipelines

  • Build the data layer that turns heterogeneous industrial data – sensor streams, engineering documents, schematics – into something our models can train on
  • Make large-scale training data tractable, including the pipelines needed for foundation-model-scale datasets
  • Own how datasets are versioned, curated, and reused across the team

Model Serving & Registry

  • Own the path from a trained checkpoint to a deployed, monitored model, including the model registry and the interface between research and product•
  • Build a serving runtime that performs in production and runs anywhere our customers operate, including air-gapped environments
  • Own the safety and observability of model rollouts in the field

DevOps & Platform

  • Own the platform our ML stack runs on – infrastructure-as-code, CI/CD, and container management
  • Build the observability story so we catch issues before our customers do


Requirements

  • 3+ years in ML engineering, MLOps, data engineering, or ML infrastructure
  • Strong Python – production-grade code, not notebooks
  • Hands-on experience building and maintaining ML training and serving pipelines
  • Hands-on cloud ML infrastructure experience (GCP preferred – Vertex AI, GCE, GCS) including GPU instance management and spot scheduling
  • Comfortable with infrastructure-as-code and CI/CD for ML workloads
  • Familiarity with PyTorch at the infrastructure level – training loop, data loading, checkpointing, export/serving mechanics
  • Experience with experiment tracking tools (MLflow, Weights & Biases, or similar)
  • Familiarity with FastAPI or similar serving frameworks
  • Strong sense of ownership


Nice to Have

  • Experience building model registries or Hugging Face-compatible model interfaces
  • Background in data engineering (ETL, DVC, data lakes)
  • Container packaging experience (Docker, ONNX, TorchScript, Triton)
  • Terraform, Pulumi, or equivalent IaC tooling
  • Streaming data pipelines for large-scale training
  • Exposure to industrial/engineering data (P&IDs, sensor data, schematics)
  • On-prem or air-gapped ML deployment experience
  • Model monitoring, drift detection, or canary deployment patterns
  • Experience bridging research and production ML teams
Maverick AI