AI-OPS Team Member | IT Infrastructure 
We are seeking a skilled and motivated AI-OPS Team Member to join our IT Infrastructure – Tools & Collaboration department.
This is an exciting opportunity to be part of a newly established AI-OPS team, responsible for centralizing and optimizing the management of our organization’s AI-related infrastructure across all environments.
About the Role
As an AI-OPS Team Member, you will play a key role in managing and maintaining our AI infrastructure—both on-premise and in the cloud. You will ensure the availability, performance, scalability, and security of AI platforms, tools, and hardware resources that support our enterprise AI initiatives.
Key Responsibilities
- Operate and maintain AI infrastructure with a focus on stability, performance, and scalability.
 - Deploy, configure, and troubleshoot AI platforms, containerized environments, and GPU-based workloads.
 - Monitor and optimize resource utilization (CPU, GPU, memory, storage, network).
 - Contribute to CI/CD processes and implement automation using Infrastructure as Code (IaC).
 - Manage security, access controls, and compliance across AI systems.
 - Collaborate with AI Platform Engineers and AI Security Engineers to resolve incidents and improve reliability.
 - Maintain documentation, runbooks, and operational best practices.
 - Continuously identify opportunities to improve efficiency and cost-effectiveness in AI infrastructure management.
 
Required Qualifications
- Proven experience in IT infrastructure operations and management.
 - Hands-on experience with AI/ML platforms (e.g., MLflow, Kubeflow).
 - Proficiency in CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions).
 - Strong knowledge of cloud infrastructure (Azure preferred, AWS or GCP).
 - Expertise in Docker and Kubernetes, especially for AI workloads.
 - Experience with GPU management and optimization.
 - Proficiency in Infrastructure as Code (Terraform, Ansible).
 - Solid Linux administration skills.
 - Familiarity with Prometheus, Grafana, and similar monitoring tools.
 - Scripting ability in Python and Bash.
 - Strong understanding of networking, distributed systems, and infrastructure security.
 - Experience managing or deploying vector databases.
 - Familiarity with AI infrastructure vendor solutions.
 - Experience in infrastructure centralization projects.
 - Relevant cloud or container certifications.