DevJobs

Manager, Software Engineering - AIOps

Overview
Skills
  • Python Python
  • Go Go
  • CI/CD CI/CD
  • Docker Docker
  • Kubernetes Kubernetes
  • Networking Networking
  • Solution Design ꞏ 3y
  • Distributed Systems ꞏ 3y
  • Autonomous systems
  • SaaS
  • NVIDIA hardware stack
  • ML models
  • LangChain
  • GPUs
  • Ethernet switching
  • DPUs
  • Closed-loop feedback monitoring tools
  • AutoGPT
  • IaaS
  • PaaS
NVIDIA is at the forefront of the AI revolution, and the AIOps department is critical to ensuring our AI-driven data centers operate with unmatched efficiency. We are looking for a visionary, hands-on Software Engineering Manager to lead a team building the next generation of AI-based monitoring and operation platforms.

This role focuses on leveraging AI Agents to automate, predict, and optimize data center performance at an internet scale. If you are a resilient leader who excels in fast-paced environments and has a passion for autonomous system operations, we want you on our team.

What You’ll Be Doing

  • Strategic Roadmap Development: Define software design and implementation roadmaps for AI-driven operations, ensuring data center availability, resiliency, and performance through autonomous agent-based monitoring.
  • Innovative AIOps Engineering: Lead the development of tools and proof-of-concepts focused on software-defined operations, utilizing AI agents to automate root cause analysis and proactive remediation.
  • Scalable Architecture: Build and scale monitoring applications that handle massive telemetry data from AI infrastructure across public, private, and hybrid cloud environments.
  • Agentic Frameworks: Oversee the integration of LLM-based agents into CI/CD and operational workflows to shift from reactive monitoring to predictive orchestration.
  • Team Leadership: Actively hire, mentor, and grow a high-performing engineering team, fostering a culture of technical excellence and creative problem-solving.
  • Customer Engagement: Directly contribute to internal and external customer engagements to align AIOps solutions with real-world data center challenges.

What We Need To See

  • BS/MS degree in Computer Science or a related technical field (or equivalent experience).
  • 8+ years of overall software engineering experience, with at least 2+ years in a management or technical lead role.
  • Domain Expertise: 3+ years of experience in system software engineering for large-scale production systems, with a strong background in Solution Design and Distributed Systems.
  • Cloud Native Mastery: Deep experience with Docker and Kubernetes orchestration, alongside PaaS or IaaS cloud platforms.
  • Programming Proficiency: Strong programming skills in Python (essential for AI/ML workflows) and Go.
  • Operational Intelligence: Extensive knowledge of CI/CD pipelines and automated software-defined operations.
  • Exceptional written and verbal communication skills to bridge the gap between complex AI logic and operational requirements.

Ways To Stand Out From The Crowd

  • AI/ML Background: Experience building or deploying AI Agents (LangChain, AutoGPT) or using ML models for anomaly detection and predictive analytics.
  • Infrastructure Knowledge: Familiarity with Ethernet switching, networking protocols, or NVIDIA’s hardware stack (GPUs/DPUs).
  • Control Systems: Experience in developing autonomous systems or closed-loop feedback monitoring tools.
  • SaaS Background: Proven track record of managing and scaling cloud-based SaaS applications.

, , JR2017429

Nvidia