NVIDIA is at the forefront of the AI revolution, and the AIOps department is critical to ensuring our AI-driven data centers operate with unmatched efficiency. We are looking for a visionary, hands-on Software Engineering Manager to lead a team building the next generation of AI-based monitoring and operation platforms.
This role focuses on leveraging AI Agents to automate, predict, and optimize data center performance at an internet scale. If you are a resilient leader who excels in fast-paced environments and has a passion for autonomous system operations, we want you on our team.
What You’ll Be Doing
- Strategic Roadmap Development: Define software design and implementation roadmaps for AI-driven operations, ensuring data center availability, resiliency, and performance through autonomous agent-based monitoring.
- Innovative AIOps Engineering: Lead the development of tools and proof-of-concepts focused on software-defined operations, utilizing AI agents to automate root cause analysis and proactive remediation.
- Scalable Architecture: Build and scale monitoring applications that handle massive telemetry data from AI infrastructure across public, private, and hybrid cloud environments.
- Agentic Frameworks: Oversee the integration of LLM-based agents into CI/CD and operational workflows to shift from reactive monitoring to predictive orchestration.
- Team Leadership: Actively hire, mentor, and grow a high-performing engineering team, fostering a culture of technical excellence and creative problem-solving.
- Customer Engagement: Directly contribute to internal and external customer engagements to align AIOps solutions with real-world data center challenges.
What We Need To See
- BS/MS degree in Computer Science or a related technical field (or equivalent experience).
- 8+ years of overall software engineering experience, with at least 2+ years in a management or technical lead role.
- Domain Expertise: 3+ years of experience in system software engineering for large-scale production systems, with a strong background in Solution Design and Distributed Systems.
- Cloud Native Mastery: Deep experience with Docker and Kubernetes orchestration, alongside PaaS or IaaS cloud platforms.
- Programming Proficiency: Strong programming skills in Python (essential for AI/ML workflows) and Go.
- Operational Intelligence: Extensive knowledge of CI/CD pipelines and automated software-defined operations.
- Exceptional written and verbal communication skills to bridge the gap between complex AI logic and operational requirements.
Ways To Stand Out From The Crowd
- AI/ML Background: Experience building or deploying AI Agents (LangChain, AutoGPT) or using ML models for anomaly detection and predictive analytics.
- Infrastructure Knowledge: Familiarity with Ethernet switching, networking protocols, or NVIDIA’s hardware stack (GPUs/DPUs).
- Control Systems: Experience in developing autonomous systems or closed-loop feedback monitoring tools.
- SaaS Background: Proven track record of managing and scaling cloud-based SaaS applications.
, , JR2017429