DevJobs

AI-OPS Team Member | IT Infrastructure

Overview
Skills
  • Python Python
  • Bash Bash
  • Linux Linux
  • Jenkins Jenkins
  • GitHub Actions GitHub Actions
  • Azure Azure
  • AWS AWS
  • GCP GCP
  • Docker Docker
  • Kubernetes Kubernetes
  • Ansible Ansible
  • Terraform Terraform
  • Networking Networking
  • Grafana Grafana
  • Infrastructure security
  • Kubeflow
  • MLflow
  • Vector databases
  • Distributed systems
  • GitLab CI
  • GPU management
  • Prometheus Prometheus
  • AI infrastructure vendor solutions

AI-OPS Team Member | IT Infrastructure

We are seeking a skilled and motivated AI-OPS Team Member to join our IT Infrastructure – Tools & Collaboration department.

This is an exciting opportunity to be part of a newly established AI-OPS team, responsible for centralizing and optimizing the management of our organization’s AI-related infrastructure across all environments.

About the Role

As an AI-OPS Team Member, you will play a key role in managing and maintaining our AI infrastructure—both on-premise and in the cloud. You will ensure the availability, performance, scalability, and security of AI platforms, tools, and hardware resources that support our enterprise AI initiatives.


Key Responsibilities

  • Operate and maintain AI infrastructure with a focus on stability, performance, and scalability.
  • Deploy, configure, and troubleshoot AI platforms, containerized environments, and GPU-based workloads.
  • Monitor and optimize resource utilization (CPU, GPU, memory, storage, network).
  • Contribute to CI/CD processes and implement automation using Infrastructure as Code (IaC).
  • Manage security, access controls, and compliance across AI systems.
  • Collaborate with AI Platform Engineers and AI Security Engineers to resolve incidents and improve reliability.
  • Maintain documentation, runbooks, and operational best practices.
  • Continuously identify opportunities to improve efficiency and cost-effectiveness in AI infrastructure management.


Required Qualifications

  • Proven experience in IT infrastructure operations and management.
  • Hands-on experience with AI/ML platforms (e.g., MLflow, Kubeflow).
  • Proficiency in CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions).
  • Strong knowledge of cloud infrastructure (Azure preferred, AWS or GCP).
  • Expertise in Docker and Kubernetes, especially for AI workloads.
  • Experience with GPU management and optimization.
  • Proficiency in Infrastructure as Code (Terraform, Ansible).
  • Solid Linux administration skills.
  • Familiarity with Prometheus, Grafana, and similar monitoring tools.
  • Scripting ability in Python and Bash.
  • Strong understanding of networking, distributed systems, and infrastructure security.
  • Experience managing or deploying vector databases.
  • Familiarity with AI infrastructure vendor solutions.
  • Experience in infrastructure centralization projects.
  • Relevant cloud or container certifications.

Unilink Ltd.